When we query data in a Fabric Lakehouse, the engine needs to decide how to execute the query: which join strategy to use, which filters to push down, and how to parallelize work across partitions. These decisions are driven by the query optimizer, and one of its most important inputs is table statistics.
In this post, we’ll look at:
- What table statistics are,
- Why they matter,
- How they improve query performance,
- And some hands-on test results I observed in Fabric Lakehouse.
What Are Table Statistics?
Table statistics are metadata that describe the distribution of data in a table. In Fabric Lakehouse, statistics can include:
- Row count
- Column-level details: distinct counts, min/max values, null counts, average/max column length
- Extended statistics: histograms that give the optimizer a more detailed picture of data distribution
These statistics are collected automatically for Delta tables (when enabled) or can be triggered manually using ANALYZE command:
%%sql
ANALYZE TABLE tableName COMPUTE STATISTICS FOR ALL COLUMNS
Purpose of Table Statistics
The optimizer uses statistics to estimate cardinality (how many rows will pass through each stage of a query). This directly affects:
- Join strategy: Should it use a Broadcast Hash Join (fast, but requires small side to fit in memory) or a Sort-Merge Join?
- Filter pushdown: Can filters be applied earlier and more effectively?
- Aggregation strategy: How should partial aggregates be distributed?
- Query parallelism: How much work should be spread across partitions?
Without statistics, the optimizer guesses. With statistics, it makes informed decisions.
How Do They Improve Query Performance?
With good statistics:
- Queries avoid unnecessary scans.
- Joins can be executed with the most efficient algorithm.
- Broadcast vs shuffle decisions are more accurate.
- Overall execution time can decrease significantly, especially on large datasets with skewed distributions.
In other words: better statistics → better plans → faster queries.
Test Case
This test uses two tables: netflix_titles and netflix_titles_no_stats. After creating these tables, automated table statistics were disabled on the second table using the statement below:
%%sql
ALTER TABLE netflix_titles_no_stats SET TBLPROPERTIES(‘delta.stats.extended.collect’ = false, ‘delta.stats.extended.inject’ = false);
As a result, one table retains statistics while the other does not. Statistics can be checked with the Spark code below:
%%spark
spark.read.table(“table_name”).queryExecution.optimizedPlan.stats.attributeStats.foreach{case (attrName, colStat) =>
println(s”colName: $attrName distinctCount: ${colStat.distinctCount} min: ${colStat.min} max: ${colStat.max} nullCount: ${colStat.nullCount} avgLen: ${colStat.avgLen} maxLen: ${colStat.maxLen}”)
}
As shown, netflix_titles_no_stats does not contain statistics, while netflix_titles does.

To test query time, the time module needs to be imported:
%%pyspark
import time
The query below was executed 10 times to measure performance:
%%sql
SELECT
t1.show_id AS show1_id,
t1.title AS title1,
t1.release_year AS year1,
t2.show_id AS show2_id,
t2.title AS title2,
t2.release_year AS year2,
t1.type AS type1,
t2.type AS type2,
CASE
WHEN t1.rating = ‘TV-MA’ THEN 1
ELSE 0
END AS is_mature1,
CASE
WHEN t2.rating = ‘TV-MA’ THEN 1
ELSE 0
END AS is_mature2,
concat(t1.title, ‘ – ‘, t1.director) AS title_director1,
concat(t2.title, ‘ – ‘, t2.director) AS title_director2
FROM netflix_titles t1
LEFT JOIN netflix_titles t2
ON t1.director = t2.director
AND t1.show_id <> t2.show_id
WHERE t1.release_year >= 2010
AND t2.release_year >= 2010
ORDER BY t1.release_year DESC, t2.release_year DESC
Test Results
Table with statistics: (Average of 39.75 seconds)

Table without statistics: (Average of 44.13 seconds)

| Type | Time (seconds) |
| Statistics | 39.75 |
| No Statistics | 44.13 |
By examining the physical query plan with the code below, it can be seen that the query on the table without statistics includes an additional isnotnull() check.

Conclusion
Even with this relatively small and well-distributed table, there is a 10% difference simply by disabling automated table statistics. For unevenly distributed and larger tables, this difference may increase significantly (up to 45%).
In conclusion, Microsoft Fabric’s query optimization demonstrates tangible efficiency gains through the use of table statistics. By collecting and leveraging accurate statistics, the Fabric engine can generate more informed query plans, reduce unnecessary operations, and improve execution times—even for complex joins and large datasets. Ensight supports organizations in implementing Fabric effectively, ensuring optimized data workflows and high-performance analytics. Leveraging these capabilities enables faster insights, better resource utilization, and more reliable query results in modern data lakehouse environments.