Understanding Table Statistics in Microsoft Fabric Lakehouse

ensight
February 03, 2026

When we query data in a Fabric Lakehouse, the engine needs to decide how to execute the query: which join strategy to use, which filters to push down, and how to parallelize work across partitions. These decisions are driven by the query optimizer, and one of its most important inputs is table statistics.

In this post, we’ll look at:

What table statistics are,
Why they matter,
How they improve query performance,
And some hands-on test results I observed in Fabric Lakehouse.

What Are Table Statistics?

Table statistics are metadata that describe the distribution of data in a table. In Fabric Lakehouse, statistics can include:

Row count
Column-level details: distinct counts, min/max values, null counts, average/max column length
Extended statistics: histograms that give the optimizer a more detailed picture of data distribution

These statistics are collected automatically for Delta tables (when enabled) or can be triggered manually using ANALYZE command:

%%sql

ANALYZE TABLE tableName COMPUTE STATISTICS FOR ALL COLUMNS

Purpose of Table Statistics

The optimizer uses statistics to estimate cardinality (how many rows will pass through each stage of a query). This directly affects:

Join strategy: Should it use a Broadcast Hash Join (fast, but requires small side to fit in memory) or a Sort-Merge Join?
Filter pushdown: Can filters be applied earlier and more effectively?
Aggregation strategy: How should partial aggregates be distributed?
Query parallelism: How much work should be spread across partitions?

Without statistics, the optimizer guesses. With statistics, it makes informed decisions.

How Do They Improve Query Performance?

With good statistics:

Queries avoid unnecessary scans.
Joins can be executed with the most efficient algorithm.
Broadcast vs shuffle decisions are more accurate.
Overall execution time can decrease significantly, especially on large datasets with skewed distributions.

In other words: better statistics → better plans → faster queries.

Test Case

This test uses two tables: netflix_titles and netflix_titles_no_stats. After creating these tables, automated table statistics were disabled on the second table using the statement below:

%%sql

ALTER TABLE netflix_titles_no_stats SET TBLPROPERTIES(‘delta.stats.extended.collect’ = false, ‘delta.stats.extended.inject’ = false);

As a result, one table retains statistics while the other does not. Statistics can be checked with the Spark code below:

%%spark

spark.read.table(“table_name”).queryExecution.optimizedPlan.stats.attributeStats.foreach{case (attrName, colStat) =>

println(s”colName: $attrName distinctCount: ${colStat.distinctCount} min: ${colStat.min} max: ${colStat.max} nullCount: ${colStat.nullCount} avgLen: ${colStat.avgLen} maxLen: ${colStat.maxLen}”)

}

As shown, netflix_titles_no_stats does not contain statistics, while netflix_titles does.

To test query time, the time module needs to be imported:

%%pyspark

import time

The query below was executed 10 times to measure performance:

%%sql

SELECT

t1.show_id AS show1_id,

t1.title AS title1,

t1.release_year AS year1,

t2.show_id AS show2_id,

t2.title AS title2,

t2.release_year AS year2,

t1.type AS type1,

t2.type AS type2,

CASE

WHEN t1.rating = ‘TV-MA’ THEN 1

ELSE 0

END AS is_mature1,

CASE

WHEN t2.rating = ‘TV-MA’ THEN 1

ELSE 0

END AS is_mature2,

concat(t1.title, ‘ – ‘, t1.director) AS title_director1,

concat(t2.title, ‘ – ‘, t2.director) AS title_director2

FROM netflix_titles t1

LEFT JOIN netflix_titles t2

ON t1.director = t2.director

AND t1.show_id <> t2.show_id

WHERE t1.release_year >= 2010

AND t2.release_year >= 2010

ORDER BY t1.release_year DESC, t2.release_year DESC

Test Results

Table with statistics: (Average of 39.75 seconds)

Table without statistics: (Average of 44.13 seconds)

Type	Time (seconds)
Statistics	39.75
No Statistics	44.13

By examining the physical query plan with the code below, it can be seen that the query on the table without statistics includes an additional isnotnull() check.

Conclusion

Even with this relatively small and well-distributed table, there is a 10% difference simply by disabling automated table statistics. For unevenly distributed and larger tables, this difference may increase significantly (up to 45%).

In conclusion, Microsoft Fabric’s query optimization demonstrates tangible efficiency gains through the use of table statistics. By collecting and leveraging accurate statistics, the Fabric engine can generate more informed query plans, reduce unnecessary operations, and improve execution times—even for complex joins and large datasets. Ensight supports organizations in implementing Fabric effectively, ensuring optimized data workflows and high-performance analytics. Leveraging these capabilities enables faster insights, better resource utilization, and more reliable query results in modern data lakehouse environments.