Getting Started with Data Clustering in Microsoft Fabric Data Warehouse

Data clustering is one of the most powerful performance features added to Microsoft Fabric Data Warehouse. It organizes your data physically in storage so that rows with similar values stay close together, and that drives two big wins: dramatically faster queries and lower compute costs on large datasets.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

If you are a data engineer or analytics professional working with Fabric, understanding how to leverage data clustering will make your analytics workloads faster and cheaper. In this article we are walking through how it works, when to use it, the CLUSTER BY syntax with examples, and practical workshop-ready scripts you can run right away.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering


How Data Clustering Works

At its core, data clustering changes how rows are stored on disk during ingestion. Instead of landing rows in arbitrary order, Fabric uses a space-filling curve algorithm to organize rows so that similar values in your chosen clustering columns end up physically close together.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

This storage arrangement makes range filters and selective scans much faster because the engine can skip entire files or row groups that do not match your filter predicate.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

This grouping happens when the data is ingested into the warehouse. It is not just a sort order like a traditional index. It is a physical layout change that lets the query engine read only the minimal set of files needed to satisfy a filter.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

For example, if your query targets 10 percent of a table’s data by date range, clustering can ensure only that 10 percent of the table is scanned, reducing I/O and compute consumption.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

The result is reduced I/O, lower CPU, and faster results, especially on large fact tables where scanning everything would be wasteful.


When You Should Use Data Clustering

Not every table benefits equally from clustering. Here are the most common scenarios where it delivers the biggest gains.

Frequent filtered queries

If the workload includes frequent queries filtering specific columns, data clustering ensures that only relevant files are scanned during read queries.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Large tables

Data clustering is most effective when applied to large tables where scanning the full dataset is costly.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Mid to high cardinality columns

Columns with higher cardinality such as IDs or dates benefit more from data clustering because they allow the engine to isolate and colocate similar values.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Columns with low cardinality such as gender or region offer limited opportunities for file skipping because values are naturally spread across more files.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Selective queries with narrow scope

When queries typically target a small subset of data combined with a WHERE filter, clustering ensures only files that contain relevant rows are read.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Fabric also provides Query Insights so you can measure CPU time and scanned data for clustered versus non-clustered queries.
Reference: https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering


CLUSTER BY Syntax

Data clustering is defined when you create a table using the CLUSTER BY clause.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Basic CREATE TABLE with clustering

CREATE TABLE Sales (
SaleID INT,
CustomerID INT,
SaleDate DATE,
Amount DECIMAL(10,2)
)
WITH (CLUSTER BY (CustomerID, SaleDate));

This creates a clustered table that groups similar CustomerID and SaleDate values together on disk.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

CREATE TABLE AS SELECT with clustering

CREATE TABLE SalesClustered
WITH (CLUSTER BY (SaleDate))
AS
SELECT * FROM Sales;

This copies the existing Sales table into a new clustered table with clustering on the SaleDate column.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Key rules:


Supported Data Types for Clustering

Only certain column types can be used in CLUSTER BY.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering

Supported:

  • int, bigint, smallint, numeric
  • float, real
  • date, datetime2, time
  • char, varchar

Not supported:

  • bit
  • varchar(max), varbinary(max)
  • uniqueidentifier

For string types only the first 32 characters are used when statistics are produced, so long prefixes might have limited benefits.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering


How to Inspect Existing Clustering

You can view clustering metadata using system views.

SELECT
t.name AS table_name,
c.name AS column_name,
ic.data_clustering_ordinal AS clustering_ordinal
FROM sys.tables t
JOIN sys.columns c
ON t.object_id = c.object_id
JOIN sys.index_columns ic
ON c.object_id = ic.object_id
AND c.column_id = ic.column_id
WHERE ic.data_clustering_ordinal > 0
ORDER BY
t.name,
ic.data_clustering_ordinal;

This shows all columns used in data clustering across the warehouse.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering


Workshop Scripts and Examples

1. Create a clustered copy

CREATE TABLE nyTaxi_clustered
WITH (CLUSTER BY (lpepPickupDatetime))
AS
SELECT * FROM nyTaxi;

This creates a clustered version keyed on the pickup date time column.
Reference: https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering

2. Compare performance

SELECT
YEAR(lpepPickupDatetime),
AVG(fareAmount) as AvgFare
FROM nyTaxi
WHERE lpepPickupDatetime BETWEEN '2008-12-31' AND '2014-06-30'
GROUP BY YEAR(lpepPickupDatetime)
OPTION (LABEL = 'Regular');
SELECT
YEAR(lpepPickupDatetime),
AVG(fareAmount) as AvgFare
FROM nyTaxi_clustered
WHERE lpepPickupDatetime BETWEEN '2008-12-31' AND '2014-06-30'
GROUP BY YEAR(lpepPickupDatetime)
OPTION (LABEL = 'Clustered');

Use Query Insights to compare data scanned and CPU usage.
Reference: https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering


Best Practices

Remember that clustering adds ingestion overhead because data must be ordered during load.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering


Final Thoughts

Data clustering in Microsoft Fabric Data Warehouse is a practical and powerful optimization for analytic workloads. By colocating similar rows, you reduce I/O, CPU, and overall capacity unit consumption while speeding up dashboards and reports.

Always validate with Query Insights before and after implementation. When aligned with real query predicates, clustering can transform large slow scans into highly efficient file skipping operations.


References

  1. Data Clustering in Fabric Data Warehouse
    https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
  2. Tutorial Using Data Clustering in Fabric
    https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering
  3. Query Insights Overview
    https://learn.microsoft.com/en-us/fabric/data-warehouse/query-insights

Discover more from SQLyard

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLyard

Subscribe now to keep reading and get access to the full archive.

Continue reading