Part 8 — Implementing the Lakehouse: Fabric, Databricks, and AWS

Part 8 — Implementing the Lakehouse: Fabric, Databricks, and AWS – SQLYARD
Deep Technical Series — Building a Modern Data Warehouse and Lakehouse

Part 8 — Implementing the Lakehouse: Fabric, Databricks, and AWS


Part 8 of 14 — Deep Technical Series: Building a Modern Data Warehouse and Lakehouse. Series index (Part 0) · ← Part 7: Implementing the Warehouse

Part 7 covered dedicated cloud warehouses — platforms where the engine manages physical storage for you and you interact primarily through SQL. This part covers lakehouses — platforms where data is stored in open table formats (Delta Lake or Apache Iceberg) on cloud object storage, and where SQL, Spark, Python, and ML workloads all operate on the same data without copying it.

The lakehouse architecture is the right choice when you need SQL analytics and Spark or ML workloads on the same data, when open formats and avoiding vendor storage lock-in matter, or when your data volume and transformation complexity make a managed warehouse too expensive or too constrained. This part covers the three major lakehouse implementations: Microsoft Fabric Lakehouse, Databricks, and AWS with Iceberg on S3.

How this connects to the series: The Bronze, Silver, and Gold layers designed in Part 3 are physically implemented here as Delta or Iceberg tables. The transformations from Parts 5 and 6 run against these tables. The SQL analytics endpoints exposed here are what the semantic layer (Part 9) and BI tools connect to — the same as a dedicated warehouse, just with open-format storage underneath.

1 What Makes a Lakehouse — Delta Lake and Iceberg Beginner

A data lake stores files on cheap object storage — S3, ADLS, GCS. It is flexible and cheap but historically unreliable for analytical workloads: no transactions, no consistent reads during writes, no schema enforcement, and poor performance for selective queries that need to find specific rows without scanning everything.

A lakehouse adds a table format layer on top of the object storage that provides the reliability guarantees of a warehouse. Delta Lake and Apache Iceberg are the two dominant table formats. They each sit between the raw Parquet files and the query engine, tracking which files belong to which table version, managing schema evolution, enabling ACID transactions, and providing the metadata needed for efficient query planning.

CapabilityRaw Data Lake (Parquet only)Lakehouse (Delta or Iceberg)
ACID transactionsNoYes
Concurrent reads and writesUnsafeSafe — snapshot isolation
Schema enforcementNo — any file format acceptedYes — writes validated against schema
Schema evolutionManual — rename files and hopeSafe ADD COLUMN, RENAME, DROP
UPDATE and DELETERewrite entire fileRow-level updates via MERGE
Time travelNoQuery any past version by timestamp or version
Query performanceFull scan unless manually partitionedFile skipping via metadata statistics
Multi-engine accessAny engine that reads ParquetStandardized engine APIs (Spark, Trino, Flink, DuckDB)

2 Delta Lake vs Apache Iceberg — Choosing the Right Format Beginner

Both formats solve the same core problem — adding warehouse reliability to lake storage — but they have different strengths and different ecosystem alignments. The choice between them is primarily determined by which platform you run on.

FeatureDelta LakeApache Iceberg
OriginDatabricks (open-sourced)Netflix (open-sourced to Apache)
Primary ecosystemDatabricks, Microsoft FabricSnowflake, AWS, BigQuery, multi-engine
Partition handlingExplicit partition columns requiredHidden partitioning — no partition columns in schema
Partition evolutionRequires table recreationAdd/change partitions without rewriting data
Engine supportExcellent on Spark/Databricks; growing elsewhereBroad — Spark, Trino, Flink, Athena, BigQuery, Snowflake, DuckDB
Z-ordering / clusteringZ-Order and Liquid Clustering (Databricks)Sort orders only (no Z-ordering)
Time travelVersion and timestampSnapshot and timestamp
Microsoft Fabric supportNative — Fabric uses Delta internallyVia shortcuts and external table access
Best forDatabricks, Fabric, single-engine Spark pipelinesMulti-engine architectures, Snowflake lakehouses, AWS

The practical rule: Use Delta Lake if your primary engine is Databricks or Microsoft Fabric. Use Iceberg if you need multi-engine access — Snowflake reading Iceberg tables on S3, Athena and Spark on the same tables, or BigQuery reading Iceberg via BigLake. If you are locked into one engine, Delta Lake’s Z-ordering and Liquid Clustering give it a performance edge. If your data needs to be read by multiple engines, Iceberg’s broader engine support makes it the safer choice.

3 Microsoft Fabric Lakehouse

Microsoft Fabric Intermediate

A Fabric Lakehouse is a Delta Lake-based storage layer inside Microsoft Fabric. It stores all data in OneLake — Fabric’s single organizational data lake — as Delta Parquet files. Every Lakehouse has two access paths: a Files section for raw file storage (Bronze), and a Tables section for managed Delta tables (Silver and Gold) that are automatically exposed through a SQL Analytics Endpoint.

Fabric Lakehouse Structure

-- Fabric Lakehouse: two zones in OneLake
-- Files/  -- raw file storage (Bronze, landing zone)
--   bronze/crm/customer/*.json
--   bronze/ecommerce/orders/*.parquet
-- Tables/  -- managed Delta tables (Silver, Gold)
--   silver_crm_customer/      -- Delta table directory
--   gold_dim_customer/
--   gold_fct_order_line/

-- SQL Analytics Endpoint: auto-generated read-only SQL interface
-- Server: .sql.fabric.microsoft.com
-- Database: 
-- All Tables/ are immediately queryable via standard T-SQL

Creating Delta Tables in Fabric Lakehouse

# Fabric Notebook (PySpark): create and manage Delta tables
# Tables created here appear automatically in the SQL Analytics Endpoint

# Method 1: Create from DataFrame (most common for Silver/Gold loads)
from pyspark.sql.functions import current_timestamp, lit

silver_df = spark.sql("""
    SELECT
        CAST(customer_id AS BIGINT)      AS customer_id,
        INITCAP(TRIM(customer_name))     AS customer_name,
        LOWER(TRIM(email))               AS email,
        COALESCE(city, 'Unknown')        AS city,
        state                            AS state_province,
        country,
        loyalty_tier,
        CAST(updated_at AS TIMESTAMP)    AS updated_at
    FROM bronze_crm_customer
    WHERE customer_id IS NOT NULL
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY updated_at DESC
    ) = 1
""")

# Write to Tables section -- automatically appears in SQL Endpoint
silver_df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("silver_crm_customer")

# Method 2: Create using Spark SQL DDL
spark.sql("""
    CREATE TABLE IF NOT EXISTS gold_dim_customer (
        customer_key         BIGINT,
        customer_natural_key STRING        NOT NULL,
        customer_name        STRING,
        email                STRING,
        city                 STRING,
        state_province       STRING,
        country              STRING,
        loyalty_tier         STRING,
        effective_from       DATE          NOT NULL,
        effective_to         DATE          NOT NULL,
        is_current           INT           NOT NULL
    )
    USING DELTA
    TBLPROPERTIES (
        'delta.enableChangeDataFeed' = 'true',    -- enables CDC from this table
        'delta.autoOptimize.optimizeWrite' = 'true',
        'delta.autoOptimize.autoCompact'   = 'true'
    )
""")

OneLake Shortcuts — Accessing External Data

# Fabric Shortcuts: mount external storage as if it were in OneLake
# No data copying -- OneLake shortcuts are metadata pointers

# Shortcut types:
# 1. ADLS Gen2 shortcut -- access external Azure Data Lake
# 2. Amazon S3 shortcut -- access S3 buckets from Fabric
# 3. Google Cloud Storage shortcut -- access GCS buckets

# Create via Fabric UI: Lakehouse → ... → New Shortcut
# Or via REST API:
import requests

shortcut_payload = {
    "path": "Files/external/crm",          # where shortcut appears in OneLake
    "name": "crm_s3_bronze",
    "type": "AmazonS3",
    "target": {
        "url": "https://s3.amazonaws.com/your-bucket/bronze/crm/",
        "connectionId": "your-connection-id"
    }
}
# After shortcut creation, Files/external/crm/ shows S3 data as if local

Fabric Lakehouse SQL Analytics Endpoint

-- Connect SSMS or Power BI to the SQL Analytics Endpoint
-- Server: .sql.fabric.microsoft.com
-- Authentication: Azure Active Directory (Entra ID)

-- All Delta tables in the Tables/ section are immediately queryable:
SELECT
    d.MonthName,
    d.Year,
    SUM(f.extended_amount)   AS Revenue,
    COUNT(DISTINCT f.customer_key) AS UniqueCustomers
FROM gold_fct_order_line f
JOIN gold_dim_date       d ON f.order_date_key = d.date_key
JOIN gold_dim_customer   c ON f.customer_key   = c.customer_key
                           AND c.is_current    = 1
WHERE d.Year = 2025
GROUP BY d.MonthName, d.Year
ORDER BY d.Year, MIN(d.MonthNumber);

-- The SQL endpoint is read-only -- all writes go through Spark notebooks
-- Queries run on the Fabric distributed SQL engine (same as Synapse serverless)

V-Order — Fabric’s Power BI Optimization

# Fabric V-Order: optimizes Delta files for Power BI Direct Lake reads
# V-Order applies a special sorting and encoding that dramatically speeds up
# Power BI column scans in Direct Lake mode
# Enabled by default in Fabric Spark -- confirm it is active:

spark.conf.get("spark.microsoft.delta.optimizeWrite.enabled")   # should be 'true'
spark.conf.get("spark.sql.parquet.vorder.enabled")              # should be 'true'

# If not enabled, set explicitly before writing Gold tables:
spark.conf.set("spark.sql.parquet.vorder.enabled", "true")
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")

# V-Order is applied automatically when writing -- no code change needed
# after enabling the config
df.write.format("delta").mode("overwrite").saveAsTable("gold_dim_customer")

4 Databricks Lakehouse

Databricks Intermediate

Databricks is the origin of Delta Lake and the most mature lakehouse platform available. It provides the Delta Lake table format, the Photon execution engine for near-warehouse SQL performance, Unity Catalog for unified governance across data and ML assets, and Databricks SQL Warehouse for BI tool connectivity. The entire medallion architecture runs natively on Databricks from Bronze ingestion through Gold dimensional models to SQL analytics.

Unity Catalog — Governance Structure

-- Unity Catalog: three-level namespace (catalog.schema.table)
-- Replaces the two-level Hive metastore (schema.table)
-- Provides RBAC, column masking, row filters, lineage, and auditing

-- Create catalog structure for the medallion architecture
CREATE CATALOG IF NOT EXISTS prod
    COMMENT 'Production analytics data';

CREATE SCHEMA IF NOT EXISTS prod.bronze
    COMMENT 'Raw ingestion layer -- immutable source data';

CREATE SCHEMA IF NOT EXISTS prod.silver
    COMMENT 'Cleaned and conformed staging layer';

CREATE SCHEMA IF NOT EXISTS prod.gold
    COMMENT 'Dimensional models and analytics-ready Gold layer';

-- Grant access by layer (RBAC in Unity Catalog)
-- Data engineers: full access to all layers
GRANT USE CATALOG, USE SCHEMA, SELECT, MODIFY ON CATALOG prod TO `data-engineers`;

-- Analysts: read-only access to Gold and Silver only
GRANT USE CATALOG ON CATALOG prod TO `analysts`;
GRANT USE SCHEMA, SELECT ON SCHEMA prod.gold   TO `analysts`;
GRANT USE SCHEMA, SELECT ON SCHEMA prod.silver TO `analysts`;

-- BI team: Gold only via Databricks SQL Warehouse
GRANT USE CATALOG ON CATALOG prod TO `bi-team`;
GRANT USE SCHEMA, SELECT ON SCHEMA prod.gold TO `bi-team`;

Gold Layer — Delta Tables with Unity Catalog

-- Databricks: complete Gold layer DDL using Unity Catalog
-- Delta tables with Liquid Clustering (recommended over Z-ORDER for new tables)

-- Dimension table
CREATE TABLE IF NOT EXISTS prod.gold.dim_customer (
    customer_key         BIGINT         NOT NULL,
    customer_natural_key STRING         NOT NULL,
    customer_name        STRING,
    email                STRING,
    city                 STRING,
    state_province       STRING,
    country              STRING,
    loyalty_tier         STRING,
    effective_from       DATE           NOT NULL,
    effective_to         DATE           NOT NULL,
    is_current           INT            NOT NULL
)
USING DELTA
CLUSTER BY (customer_key)       -- Liquid Clustering: auto-managed clustering
TBLPROPERTIES (
    'delta.enableChangeDataFeed'           = 'true',
    'delta.autoOptimize.optimizeWrite'     = 'true',
    'delta.autoOptimize.autoCompact'       = 'true',
    'delta.columnMapping.mode'             = 'name'  -- enables column rename/drop
);

-- Fact table: cluster by most common filter columns
CREATE TABLE IF NOT EXISTS prod.gold.fct_order_line (
    order_line_key    BIGINT         NOT NULL,
    order_natural_key STRING         NOT NULL,
    order_date        DATE           NOT NULL,
    order_date_key    INT            NOT NULL,
    customer_key      BIGINT         NOT NULL,
    product_key       BIGINT         NOT NULL,
    store_key         BIGINT         NOT NULL,
    quantity          INT            NOT NULL,
    unit_price        DECIMAL(18,2)  NOT NULL,
    discount_amount   DECIMAL(18,2)  NOT NULL,
    extended_amount   DECIMAL(18,2)  NOT NULL
)
USING DELTA
CLUSTER BY (order_date, customer_key)  -- Liquid Clustering on most common filters
TBLPROPERTIES (
    'delta.enableChangeDataFeed'       = 'true',
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact'   = 'true'
);

Liquid Clustering vs Z-ORDER

-- Databricks recommends Liquid Clustering for new tables (DBR 13.3+)
-- It is incremental and automatic -- no need to run OPTIMIZE manually
-- Z-ORDER is still used for existing tables and when Liquid Clustering is unavailable

-- Z-ORDER: run manually after bulk loads or weekly
OPTIMIZE prod.gold.fct_order_line
ZORDER BY (customer_key, order_date_key);

-- Liquid Clustering: enabled at CREATE TABLE time (shown above)
-- Databricks automatically re-clusters as data is written
-- No OPTIMIZE command needed for Liquid Clustering tables

-- Check clustering status
DESCRIBE DETAIL prod.gold.fct_order_line;
-- Look for: clusteringColumns field to confirm Liquid Clustering is active

Databricks SQL Warehouse — BI Connectivity

-- Create a Databricks SQL Warehouse for BI tool connections
-- SQL Warehouses use the Photon execution engine for fast analytical queries
-- Configure via Databricks UI: SQL → SQL Warehouses → Create

-- Connection details for BI tools (from SQL Warehouse settings):
-- Server hostname: .azuredatabricks.net
-- HTTP Path: /sql/1.0/warehouses/
-- Authentication: Personal Access Token or OAuth

-- Test query after connecting Power BI, Tableau, or Looker:
SELECT
    c.customer_name,
    c.loyalty_tier,
    SUM(f.extended_amount)  AS total_revenue,
    COUNT(*)                AS order_count
FROM prod.gold.fct_order_line f
JOIN prod.gold.dim_customer   c
    ON f.customer_key = c.customer_key
    AND c.is_current  = 1
WHERE f.order_date >= '2025-01-01'
GROUP BY c.customer_name, c.loyalty_tier
ORDER BY total_revenue DESC
LIMIT 20;

Time Travel — Querying Historical Versions

-- Delta time travel: query any past version of a table
-- Invaluable for debugging pipeline errors and auditing

-- Query as of a specific timestamp
SELECT COUNT(*) AS row_count
FROM prod.gold.fct_order_line
TIMESTAMP AS OF '2025-01-15 00:00:00';

-- Query as of a specific version number
SELECT COUNT(*) AS row_count
FROM prod.gold.fct_order_line
VERSION AS OF 42;

-- List all available versions
DESCRIBE HISTORY prod.gold.fct_order_line;
-- Shows: version, timestamp, operation (WRITE, MERGE, DELETE, OPTIMIZE, etc.)

-- Restore a table to a previous version (use carefully in production)
RESTORE TABLE prod.gold.fct_order_line
TO VERSION AS OF 40;
-- This is the "undo" button for accidental data corruption

5 AWS Lakehouse — Iceberg on S3

AWS Advanced

The AWS lakehouse architecture centers on Apache Iceberg tables stored in S3, registered in the AWS Glue Data Catalog, and queryable by multiple compute engines — Amazon Athena (serverless SQL), Apache Spark on EMR, Redshift Spectrum (from the Redshift warehouse), and AWS Glue ETL jobs. The Glue Catalog acts as the central metadata registry that all engines share, enabling true multi-engine access to the same data.

Creating Iceberg Tables via Glue Catalog

# AWS Glue ETL Job: create Iceberg tables on S3 using Spark
# These tables are registered in the Glue Catalog and accessible by all AWS engines

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog",
            "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse",
            "s3://your-bucket/lakehouse/") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl",
            "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.io-impl",
            "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()

# Create Gold dimension table as Iceberg
spark.sql("""
    CREATE TABLE IF NOT EXISTS glue_catalog.gold.dim_customer (
        customer_key         BIGINT         NOT NULL,
        customer_natural_key STRING         NOT NULL,
        customer_name        STRING,
        email                STRING,
        city                 STRING,
        state_province       STRING,
        country              STRING,
        loyalty_tier         STRING,
        effective_from       DATE           NOT NULL,
        effective_to         DATE           NOT NULL,
        is_current           INT            NOT NULL
    )
    USING iceberg
    LOCATION 's3://your-bucket/lakehouse/gold/dim_customer/'
    TBLPROPERTIES (
        'write.format.default'       = 'parquet',
        'write.parquet.compression-codec' = 'zstd',
        'format-version'             = '2',     -- v2 enables row-level deletes/updates
        'write.target-file-size-bytes' = '536870912'  -- 512 MB target file size
    )
""")

# Create Gold fact table with hidden partitioning (Iceberg advantage over Delta)
spark.sql("""
    CREATE TABLE IF NOT EXISTS glue_catalog.gold.fct_order_line (
        order_line_key    BIGINT         NOT NULL,
        order_natural_key STRING         NOT NULL,
        order_date        DATE           NOT NULL,
        order_date_key    INT            NOT NULL,
        customer_key      BIGINT         NOT NULL,
        product_key       BIGINT         NOT NULL,
        quantity          INT            NOT NULL,
        unit_price        DECIMAL(18,2)  NOT NULL,
        extended_amount   DECIMAL(18,2)  NOT NULL
    )
    USING iceberg
    PARTITIONED BY (months(order_date))   -- hidden partitioning: no partition column in schema
    LOCATION 's3://your-bucket/lakehouse/gold/fct_order_line/'
    TBLPROPERTIES (
        'format-version'             = '2',
        'write.target-file-size-bytes' = '536870912'
    )
""")

Hidden Partitioning — Iceberg’s Key Advantage

-- Iceberg hidden partitioning: the partition transform is defined in table metadata
-- The column 'order_date' stays as a DATE in the schema -- no 'year'/'month' columns
-- Iceberg automatically routes writes to the correct partition
-- Queries on order_date automatically benefit from partition pruning

-- Query with automatic partition pruning (no special syntax needed)
SELECT SUM(extended_amount) AS revenue
FROM glue_catalog.gold.fct_order_line
WHERE order_date BETWEEN DATE '2025-01-01' AND DATE '2025-03-31';
-- Iceberg reads only January, February, March partitions
-- No explicit partition filter needed -- works automatically

-- Compare: Delta Lake requires explicit partition columns in the schema
-- Users must know to filter on 'order_date_month' not 'order_date' for pruning
-- Iceberg hides this complexity entirely

Multi-Engine Access — Same Iceberg Tables, Multiple Engines

-- Athena: query Iceberg tables with serverless SQL (no cluster needed)
-- Connect from any SQL client using Athena JDBC or the AWS Console Query Editor

SELECT
    DATE_TRUNC('month', order_date)     AS order_month,
    SUM(extended_amount)                AS revenue
FROM gold.fct_order_line
WHERE order_date >= DATE '2025-01-01'
GROUP BY 1
ORDER BY 1;

-- Redshift Spectrum: query the same Iceberg tables from Redshift SQL
-- Create external schema pointing to Glue Catalog
CREATE EXTERNAL SCHEMA iceberg_gold
FROM DATA CATALOG
DATABASE 'gold'
IAM_ROLE 'arn:aws:iam::account:role/RedshiftSpectrumRole';

-- Now query in Redshift with the same SQL:
SELECT SUM(f.extended_amount) AS revenue, c.loyalty_tier
FROM iceberg_gold.fct_order_line f
JOIN iceberg_gold.dim_customer   c
    ON f.customer_key = c.customer_key
    AND c.is_current  = 1
GROUP BY c.loyalty_tier;

-- EMR Spark: the same tables accessed via Spark
df = spark.read.format("iceberg") \
    .load("glue_catalog.gold.fct_order_line")
df.filter("order_date >= '2025-01-01'").groupBy("customer_key").sum("extended_amount")

Iceberg Maintenance — Snapshots and Compaction

# Iceberg maintenance: expire old snapshots and compact small files
# Run weekly or after large batch loads

# Expire snapshots older than 7 days (reclaim S3 storage)
spark.sql("""
    CALL glue_catalog.system.expire_snapshots(
        table      => 'gold.fct_order_line',
        older_than => TIMESTAMP '2025-01-01 00:00:00',
        retain_last => 5    -- always keep at least 5 snapshots for time travel
    )
""")

# Compact small files into larger ones (rewrite data files)
spark.sql("""
    CALL glue_catalog.system.rewrite_data_files(
        table   => 'gold.fct_order_line',
        options => map('target-file-size-bytes', '536870912')  -- 512 MB
    )
""")

# Rewrite manifest files (improves query planning performance)
spark.sql("""
    CALL glue_catalog.system.rewrite_manifests('gold.fct_order_line')
""")

6 Snowflake Native Iceberg Tables

Snowflake Intermediate

Snowflake supports Native Iceberg Tables — tables stored as Iceberg format on your own cloud storage (S3, ADLS, or GCS) but fully managed and queried through Snowflake’s engine. This is Snowflake’s answer to the open-format demand: you get Snowflake’s SQL engine and performance, but the underlying data is in open Iceberg format that other engines can also read.

-- Snowflake Native Iceberg Tables
-- Data stored on your S3/ADLS/GCS in Iceberg format
-- Queried and managed through Snowflake SQL
-- Other engines (Spark, Athena) can also read the same files

-- Step 1: Create an external volume pointing to your S3 storage
CREATE OR REPLACE EXTERNAL VOLUME iceberg_vol
    STORAGE_LOCATIONS = (
        (
            NAME = 'iceberg-s3'
            STORAGE_PROVIDER = 'S3'
            STORAGE_BASE_URL = 's3://your-bucket/iceberg/'
            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::account:role/SnowflakeIcebergRole'
        )
    );

-- Step 2: Create a Native Iceberg Table
CREATE TABLE gold.dim_customer_iceberg (
    customer_key         BIGINT         NOT NULL,
    customer_natural_key VARCHAR(100)   NOT NULL,
    customer_name        VARCHAR(200),
    email                VARCHAR(200),
    city                 VARCHAR(100),
    state_province       VARCHAR(100),
    country              VARCHAR(100),
    loyalty_tier         VARCHAR(50),
    effective_from       DATE           NOT NULL,
    effective_to         DATE           NOT NULL,
    is_current           TINYINT        NOT NULL
)
CATALOG              = 'SNOWFLAKE'
EXTERNAL_VOLUME      = 'iceberg_vol'
BASE_LOCATION        = 'gold/dim_customer_iceberg/'
-- Data is stored in Iceberg format on S3
-- Fully readable by Spark, Athena, and other Iceberg-compatible engines

7 Table Maintenance — Compaction, Vacuuming, and Statistics Intermediate

Lakehouse tables require ongoing maintenance that managed warehouses handle automatically. Without regular maintenance, small files accumulate, metadata grows, and query performance degrades. Establish a maintenance schedule for every production lakehouse table.

Maintenance by Platform

OperationFabricDatabricksAWS Iceberg
File compaction Auto-compact (enabled by TBLPROPERTIES) or manual Optimize() OPTIMIZE table or auto-optimize TBLPROPERTIES rewrite_data_files procedure
Old version cleanup VACUUM table RETAIN 168 HOURS VACUUM table RETAIN 168 HOURS expire_snapshots procedure
Statistics update Auto on write Auto on write; ANALYZE TABLE for manual rewrite_manifests procedure
Recommended frequency Auto (no manual schedule needed) Weekly OPTIMIZE + VACUUM, or after large loads Weekly expire_snapshots + rewrite_data_files
-- Databricks: weekly maintenance script for all Gold tables
-- Run as a scheduled Databricks Job (Workflows) every Sunday

-- Compact and Z-order (or Liquid Clustering tables handle this automatically)
OPTIMIZE prod.gold.fct_order_line
ZORDER BY (customer_key, order_date_key);

OPTIMIZE prod.gold.dim_customer;

-- Remove old file versions (keep 7 days for time travel)
VACUUM prod.gold.fct_order_line RETAIN 168 HOURS;
VACUUM prod.gold.dim_customer   RETAIN 168 HOURS;

-- Check table health after maintenance
DESCRIBE DETAIL prod.gold.fct_order_line;
-- Look for: numFiles (should be manageable, not thousands)
-- sizeInBytes / numFiles = average file size (target 128MB-1GB per file)

8 Multi-Engine Access — Querying the Same Data Across Tools Advanced

One of the most compelling lakehouse advantages is the ability to query the same physical data from multiple engines without copying it. A single Gold fact table in Iceberg on S3 can be queried simultaneously by Athena (serverless SQL for ad-hoc analytics), Redshift Spectrum (from the warehouse), Spark on EMR (for large-scale aggregations), and a Python notebook (for ML feature engineering).

# Multi-engine access pattern: same Iceberg table, three different engines

# 1. Python / Spark (EMR or Databricks): read for ML feature engineering
import boto3
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
features_df = (
    spark.read.format("iceberg")
    .load("glue_catalog.gold.fct_order_line")
    .filter("order_date >= '2024-01-01'")
    .groupBy("customer_key")
    .agg(
        {"extended_amount": "sum",  "order_line_key": "count"}
    )
    .withColumnRenamed("sum(extended_amount)", "total_spend")
    .withColumnRenamed("count(order_line_key)", "order_count")
)
features_df.write.format("iceberg") \
    .mode("overwrite") \
    .saveAsTable("glue_catalog.ml.customer_features")
-- 2. Athena: ad-hoc SQL analytics on the same table
-- Serverless -- no cluster to start, pay per query
SELECT
    c.loyalty_tier,
    COUNT(DISTINCT f.customer_key)          AS customers,
    SUM(f.extended_amount)                  AS revenue,
    AVG(f.extended_amount)                  AS avg_order_value
FROM gold.fct_order_line  f
JOIN gold.dim_customer    c
    ON f.customer_key = c.customer_key
    AND c.is_current  = 1
WHERE f.order_date >= DATE '2025-01-01'
GROUP BY c.loyalty_tier
ORDER BY revenue DESC;

No data duplication, no synchronization, no staleness. Every engine reads from the same Iceberg files on S3. When the ETL pipeline writes new rows to the Gold table, every engine sees them immediately — there is no import job, no snapshot to refresh, no secondary copy to keep in sync. This is the fundamental value of multi-engine lakehouse access over traditional warehouse-plus-extract patterns.

9 Lakehouse vs Warehouse — When to Use Each Beginner

After two parts covering both patterns, the decision framework is straightforward. The right answer for most organizations is not one or the other — it is both, with the warehouse serving as the Gold SQL endpoint for BI tools and the lakehouse providing the Bronze and Silver storage foundation and the ML/Spark compute layer.

Use caseWarehouse (Part 7)Lakehouse (Part 8)
SQL analytics and BI dashboardsExcellent — purpose-builtGood — via SQL endpoint
ML feature engineeringPossible but not idealExcellent — native Spark/Python
Raw data archivalToo expensiveCheap object storage
Multi-engine accessVendor-specificOpen formats — any engine
Storage cost at scaleHigher — proprietary storageLower — object storage prices
BI tool integrationNative drivers, Direct LakeSQL endpoint — same as warehouse
ACID transactionsFullFull (Delta or Iceberg)
Operational overheadLow — managed serviceMedium — maintenance scripts needed
Streaming data ingestionRequires staging tablesNative Spark Structured Streaming

The modern best practice: Use the lakehouse for Bronze and Silver storage — cheap, flexible, Spark-accessible. Use either a lakehouse SQL endpoint or a dedicated warehouse for Gold — whichever gives your BI tools the best query performance and developer experience. On Fabric, the Lakehouse and Warehouse share OneLake storage so you get both automatically. On AWS, Redshift queries Iceberg via Spectrum so you get both. On Databricks, the SQL Warehouse serves both analytical SQL and BI connectivity over the same Delta tables.

10 Workshops

Novice

Build a Fabric Lakehouse

  • Create a new Fabric Lakehouse in your workspace
  • Upload a CSV to the Files/bronze/ section
  • Open a Fabric Notebook and read the file into a DataFrame
  • Apply basic Silver cleaning (type casting, deduplication)
  • Write to the Tables/ section as a Delta table
  • Query the table from the SQL Analytics Endpoint in SSMS

Intermediate

Build a Databricks Delta Lakehouse

  • Create Bronze, Silver, and Gold schemas in Unity Catalog
  • Load raw data as a Bronze Delta table
  • Build a Silver cleaning notebook with deduplication
  • Create Gold dimensional tables with Liquid Clustering
  • Connect Databricks SQL Warehouse to a BI tool (Power BI or Tableau)
  • Run a time-travel query to verify Delta history is working

Advanced

Build a Multi-Engine AWS Iceberg Lakehouse

  • Create Iceberg tables on S3 with hidden partitioning via Glue
  • Load Bronze data via Glue ETL job, Silver and Gold via Spark
  • Query the Gold fact table from Athena (serverless SQL)
  • Query the same table from Redshift Spectrum
  • Run a Python ML feature extraction job on EMR against the same table
  • Schedule weekly expire_snapshots and rewrite_data_files maintenance

References

Up next → Part 9: Designing the Semantic Layer and Metrics Layer — Power BI semantic models, LookML, dbt Metrics, Databricks SQL, Snowflake dynamic tables, RLS, and governed metric definitions.

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading