Data clustering is one of the most powerful performance features added to Microsoft Fabric Data Warehouse. It organizes your data physically in storage so that rows with similar values stay close together, and that drives two big wins: dramatically faster queries and lower compute costs on large datasets.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
If you are a data engineer or analytics professional working with Fabric, understanding how to leverage data clustering will make your analytics workloads faster and cheaper. In this article we are walking through how it works, when to use it, the CLUSTER BY syntax with examples, and practical workshop-ready scripts you can run right away.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
How Data Clustering Works
At its core, data clustering changes how rows are stored on disk during ingestion. Instead of landing rows in arbitrary order, Fabric uses a space-filling curve algorithm to organize rows so that similar values in your chosen clustering columns end up physically close together.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
This storage arrangement makes range filters and selective scans much faster because the engine can skip entire files or row groups that do not match your filter predicate.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
This grouping happens when the data is ingested into the warehouse. It is not just a sort order like a traditional index. It is a physical layout change that lets the query engine read only the minimal set of files needed to satisfy a filter.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
For example, if your query targets 10 percent of a table’s data by date range, clustering can ensure only that 10 percent of the table is scanned, reducing I/O and compute consumption.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
The result is reduced I/O, lower CPU, and faster results, especially on large fact tables where scanning everything would be wasteful.
When You Should Use Data Clustering
Not every table benefits equally from clustering. Here are the most common scenarios where it delivers the biggest gains.
Frequent filtered queries
If the workload includes frequent queries filtering specific columns, data clustering ensures that only relevant files are scanned during read queries.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Large tables
Data clustering is most effective when applied to large tables where scanning the full dataset is costly.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Mid to high cardinality columns
Columns with higher cardinality such as IDs or dates benefit more from data clustering because they allow the engine to isolate and colocate similar values.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Columns with low cardinality such as gender or region offer limited opportunities for file skipping because values are naturally spread across more files.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Selective queries with narrow scope
When queries typically target a small subset of data combined with a WHERE filter, clustering ensures only files that contain relevant rows are read.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Fabric also provides Query Insights so you can measure CPU time and scanned data for clustered versus non-clustered queries.
Reference: https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering
CLUSTER BY Syntax
Data clustering is defined when you create a table using the CLUSTER BY clause.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Basic CREATE TABLE with clustering
CREATE TABLE Sales ( SaleID INT, CustomerID INT, SaleDate DATE, Amount DECIMAL(10,2))WITH (CLUSTER BY (CustomerID, SaleDate));
This creates a clustered table that groups similar CustomerID and SaleDate values together on disk.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
CREATE TABLE AS SELECT with clustering
CREATE TABLE SalesClusteredWITH (CLUSTER BY (SaleDate))ASSELECT * FROM Sales;
This copies the existing Sales table into a new clustered table with clustering on the SaleDate column.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Key rules:
- You can specify between one and four clustering columns
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering - Clustering must be defined at creation time and cannot be added later
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering - The order of columns in CLUSTER BY does not affect storage layout
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Supported Data Types for Clustering
Only certain column types can be used in CLUSTER BY.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Supported:
- int, bigint, smallint, numeric
- float, real
- date, datetime2, time
- char, varchar
Not supported:
- bit
- varchar(max), varbinary(max)
- uniqueidentifier
For string types only the first 32 characters are used when statistics are produced, so long prefixes might have limited benefits.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
How to Inspect Existing Clustering
You can view clustering metadata using system views.
SELECT t.name AS table_name, c.name AS column_name, ic.data_clustering_ordinal AS clustering_ordinalFROM sys.tables tJOIN sys.columns c ON t.object_id = c.object_idJOIN sys.index_columns ic ON c.object_id = ic.object_id AND c.column_id = ic.column_idWHERE ic.data_clustering_ordinal > 0ORDER BY t.name, ic.data_clustering_ordinal;
This shows all columns used in data clustering across the warehouse.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Workshop Scripts and Examples
1. Create a clustered copy
CREATE TABLE nyTaxi_clusteredWITH (CLUSTER BY (lpepPickupDatetime))ASSELECT * FROM nyTaxi;
This creates a clustered version keyed on the pickup date time column.
Reference: https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering
2. Compare performance
SELECT YEAR(lpepPickupDatetime), AVG(fareAmount) as AvgFareFROM nyTaxiWHERE lpepPickupDatetime BETWEEN '2008-12-31' AND '2014-06-30'GROUP BY YEAR(lpepPickupDatetime)OPTION (LABEL = 'Regular');
SELECT YEAR(lpepPickupDatetime), AVG(fareAmount) as AvgFareFROM nyTaxi_clusteredWHERE lpepPickupDatetime BETWEEN '2008-12-31' AND '2014-06-30'GROUP BY YEAR(lpepPickupDatetime)OPTION (LABEL = 'Clustered');
Use Query Insights to compare data scanned and CPU usage.
Reference: https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering
Best Practices
- Choose clustering columns based on real WHERE predicates
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering - Do not cluster by more columns than necessary
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering - Prefer mid to high cardinality columns
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering - Batch ingestion with at least one million rows
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Remember that clustering adds ingestion overhead because data must be ordered during load.
Reference: https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering
Final Thoughts
Data clustering in Microsoft Fabric Data Warehouse is a practical and powerful optimization for analytic workloads. By colocating similar rows, you reduce I/O, CPU, and overall capacity unit consumption while speeding up dashboards and reports.
Always validate with Query Insights before and after implementation. When aligned with real query predicates, clustering can transform large slow scans into highly efficient file skipping operations.
References
- Data Clustering in Fabric Data Warehouse
https://learn.microsoft.com/en-us/fabric/data-warehouse/data-clustering - Tutorial Using Data Clustering in Fabric
https://learn.microsoft.com/zh-tw/fabric/data-warehouse/tutorial-data-clustering - Query Insights Overview
https://learn.microsoft.com/en-us/fabric/data-warehouse/query-insights
Discover more from SQLyard
Subscribe to get the latest posts sent to your email.


