Part 2 — Choosing the Architecture: Warehouse, Lake, or Lakehouse – SQLYARD

Leave a Comment / Articles, Data Engineering / By SQLYARD

Deep Technical Series — Building a Modern Data Warehouse and Lakehouse

Part 2 — Choosing the Architecture: Warehouse, Lake, or Lakehouse

By SQLYARD · SQLYARD.com · April 2026 · Estimated read: 30–35 min

Part 2 of 14 — Deep Technical Series: Building a Modern Data Warehouse and Lakehouse. View the full series index (Part 0) · ← Part 1: Dimensional Modeling

Once your dimensional model is defined, the next major decision is architectural. Your entire platform — how data is stored, how it is queried, what it costs, how it scales, and how much engineering it requires to maintain — depends on choosing the right combination of warehouse, data lake, and lakehouse patterns.

This part breaks down each architecture at a professional architect level. You will learn when to choose each platform, how the major cloud vendors differ in their approaches, how these patterns support the dimensional designs built in Part 1, and how to use a decision framework to choose the right fit for your organization. Every platform covered here — Snowflake, Microsoft Fabric, Databricks, Google BigQuery, Amazon Redshift, and Azure Synapse — is evaluated honestly with its strengths, limitations, and best-fit scenarios.

What this part covers: Architecture patterns and platform selection. Physical implementation details for each platform — clustering keys, distribution strategies, Delta Lake ACID, Iceberg table design — are covered in Parts 7 and 8. This part gives you the knowledge to make the right architectural choice. The later parts show you how to build it.

Table of Contents

The Three Core Architecture Patterns
Snowflake — The Managed Cloud Warehouse
Microsoft Fabric — The Unified Platform
Databricks — The Lakehouse Origin
Google BigQuery — The Serverless Warehouse
Amazon Redshift + AWS Glue + S3
Azure Synapse Dedicated SQL Pool
Architecture Selection Framework
Architecture Diagrams
Mapping Architecture to Your Dimensional Model
Workshops
References

1 The Three Core Architecture Patterns Beginner

Modern analytical systems fall into one of three patterns — or a combination of them. Understanding the distinction is essential before evaluating any specific platform, because the same platform can be used in different architectural patterns depending on how you configure it.

Pattern 1: Data Warehouse

A cloud data warehouse is a SQL-optimized compute and storage engine designed for structured, curated data. It stores your dimensional model directly and serves queries from BI tools. The warehouse is the destination for your Gold layer.

✓ Strengths

Dimensional models run extremely fast — columnar engines are purpose-built for star schema aggregations
BI tools connect directly with no additional infrastructure
Handles massive concurrent aggregations efficiently
Predictable performance with well-understood tuning knobs
Strong concurrency and isolation
Low operational overhead on managed platforms

✗ Limitations

Less flexible with semi-structured and unstructured data
Historical raw data usually lives outside the warehouse in a lake
Cost can increase with large-volume raw data ingestion
Proprietary storage formats on some platforms limit portability

Best examples: Snowflake, Fabric Warehouse, BigQuery, Redshift, Synapse Dedicated SQL Pool

Pattern 2: Data Lake

A data lake stores all data — structured, unstructured, semi-structured — on low-cost object storage. It is the raw archive layer and the foundation for Bronze and Silver in the medallion architecture. It does not provide SQL query performance on its own — that requires a compute engine layered on top.

✓ Strengths

Extremely low storage cost at any scale
Holds raw, historical, and experimental data
Supports all formats — JSON, CSV, Parquet, Delta, Iceberg, ORC
Enables ML workloads, Spark processing, and data experimentation
Open formats allow multi-engine access (Spark, Athena, Presto, Trino)

✗ Limitations

Requires engineering discipline to avoid becoming a “data swamp”
SQL query performance depends on the compute engine layered on top
Harder to govern without a dedicated metadata catalog
No ACID transactions without Delta Lake or Iceberg table format

Best examples: Azure Data Lake Storage Gen2 (ADLS), Amazon S3, Google Cloud Storage, OneLake (Fabric)

Pattern 3: Lakehouse

A lakehouse merges the reliability and SQL performance of a warehouse with the flexibility and cost efficiency of a lake. It stores data in open table formats (Delta Lake or Apache Iceberg) on object storage, adds ACID transaction guarantees, and exposes a SQL endpoint for BI tools while simultaneously supporting Spark, Python, and ML workloads on the same data.

✓ Strengths

One storage layer serves both SQL analytics and ML/data science
Open formats (Delta, Iceberg) — no vendor lock-in on storage
ACID transactions on lake storage — reliable incremental updates
Time travel and version history built in
Cost-efficient — lake storage prices at petabyte scale
Multi-engine access — Spark, SQL, Python all work on the same tables

✗ Limitations

Requires stronger engineering discipline than a managed warehouse
Some BI tools still prefer warehouse endpoints over lakehouse SQL endpoints
File management (compaction, vacuuming) requires ongoing attention
Ecosystem maturity varies — Databricks most mature, others still catching up

Best examples: Databricks Delta Lake, Microsoft Fabric Lakehouse, Snowflake with Native Iceberg Tables, AWS with Iceberg on S3, BigQuery with BigLake External Tables

2 Snowflake — The Managed Cloud Warehouse Beginner

Snowflake is a fully managed cloud warehouse that separates compute and storage. It handles nearly all infrastructure decisions for you — auto-scaling, auto-suspend, micro-partitioning, and query optimization — making it the lowest-friction path to production SQL analytics. As of 2026 it also supports native Iceberg tables, allowing Snowflake to function as a lakehouse for organizations that want open formats on Snowflake-managed storage.

Snowflake Strengths

Auto-scaling and auto-suspend virtual warehouses. Near-zero maintenance. Consistent fast SQL. Excellent for dimensional models. Time travel up to 90 days. Native Iceberg and external table support. Supports Python, Java, SQL, and Snowpark. Multi-cloud (AWS, Azure, GCP).

Snowflake Best Fit

Organizations that want a simple, fast, SQL-first warehouse. Teams with minimal operational overhead requirements. When you want great performance without managing infrastructure. When your workload is primarily SQL with some Python via Snowpark.

-- Snowflake: your Gold dimensional model sits here directly
-- No additional configuration needed for basic star schema performance
-- For large tables (100M+ rows) add a clustering key:
ALTER TABLE gold.FctOrderLine
CLUSTER BY (OrderDateKey, CustomerKey);

-- Verify clustering state
SELECT SYSTEM$CLUSTERING_INFORMATION('gold.FctOrderLine');

-- Check micro-partition efficiency for a query
EXPLAIN
SELECT d.MonthName, SUM(f.ExtendedAmount)
FROM gold.FctOrderLine f
JOIN gold.DimDate d ON f.OrderDateKey = d.DateKey
WHERE d.Year = 2025
GROUP BY d.MonthName;

Reference: Snowflake Key Architecture Concepts

3 Microsoft Fabric — The Unified Platform Beginner

Microsoft Fabric unifies data engineering, data science, Power BI, streaming analytics, and governance under a single platform backed by OneLake — a single organizational data lake that all Fabric workloads share. It provides both a Lakehouse (Delta tables with a SQL analytics endpoint) and a Warehouse (a dedicated SQL compute engine) within the same environment. Power BI is native — semantic models connect directly to Fabric data with no data movement.

Fabric Strengths

Lakehouse and Warehouse coexist on the same OneLake storage. Power BI is native — Direct Lake mode for zero-copy reporting. Fabric Mirroring brings operational SQL Server data in within seconds. Copilot integration across all workloads. Strong governance via Microsoft Purview.

Fabric Best Fit

Organizations heavily invested in Power BI and the Microsoft ecosystem. Teams that want warehouse and lakehouse capabilities together. When governance, Purview integration, and Entra ID identity matter. When you want near-real-time operational analytics via Fabric Mirroring.

-- Fabric Warehouse: Gold dimensional model runs as standard T-SQL
-- Same DDL as SQL Server -- familiar for SQL Server DBAs
CREATE TABLE gold.FctOrderLine (
    OrderLineKey    BIGINT          NOT NULL,
    OrderDateKey    INT             NOT NULL,
    CustomerKey     BIGINT          NOT NULL,
    ProductKey      BIGINT          NOT NULL,
    Quantity        INT             NOT NULL,
    UnitPrice       DECIMAL(18,2)   NOT NULL,
    ExtendedAmount  DECIMAL(18,2)   NOT NULL
);

-- Fabric Lakehouse: same data as Delta tables with SQL analytics endpoint
-- Connect SSMS or Power BI to the SQL analytics endpoint:
-- Server: .sql.fabric.microsoft.com
-- Database: 

-- Query Gold Delta tables from the SQL endpoint (read-only)
SELECT d.MonthName, SUM(f.ExtendedAmount) AS Revenue
FROM gold.FctOrderLine f
JOIN gold.DimDate d ON f.OrderDateKey = d.DateKey
GROUP BY d.MonthName;

Reference: Microsoft Fabric Lakehouse Overview

4 Databricks — The Lakehouse Origin Intermediate

Databricks created Delta Lake and coined the term “lakehouse.” It remains the most mature and most capable open-source-aligned lakehouse platform, particularly for organizations with heavy Spark workloads, ML pipelines, and complex data engineering. The Photon execution engine provides near-warehouse SQL performance on Delta tables. Unity Catalog provides unified governance across all data and ML assets.

Databricks Strengths

Best-in-class Spark engine with Photon acceleration. Delta Lake ACID transactions on all tables. Unity Catalog for unified data and ML governance. Excellent for ML, AI, and streaming at scale. SQL Warehouse provides fast BI-ready queries. Open formats — no storage vendor lock-in.

Databricks Best Fit

Large data volumes with complex transformations. Organizations running ML and AI pipelines alongside analytics. Teams comfortable with Spark and Python. When open-format storage and maximum flexibility matter. When Unity Catalog governance is required across data and models.

-- Databricks: Gold layer as Delta tables in Unity Catalog
-- Create catalog and schema structure
CREATE CATALOG IF NOT EXISTS prod;
CREATE SCHEMA IF NOT EXISTS prod.gold;

-- Create dimensional tables as Delta
CREATE TABLE prod.gold.dim_customer (
    customer_key        BIGINT NOT NULL,
    customer_natural_key STRING NOT NULL,
    full_name           STRING,
    email               STRING,
    city                STRING,
    state_province      STRING,
    country             STRING,
    loyalty_tier        STRING,
    effective_from      DATE NOT NULL,
    effective_to        DATE NOT NULL,
    is_current          INT NOT NULL
) USING DELTA
CLUSTER BY (customer_key);  -- liquid clustering (Databricks default since DBR 13.3)

-- Optimize Delta table and apply Z-ordering on high-cardinality join columns
OPTIMIZE prod.gold.fct_order_line
ZORDER BY (customer_key, order_date_key);

-- Time travel -- query Gold as it existed 7 days ago
SELECT * FROM prod.gold.fct_order_line
TIMESTAMP AS OF date_sub(current_date(), 7)
LIMIT 100;

Reference: Databricks Lakehouse Platform

5 Google BigQuery — The Serverless Warehouse Beginner

BigQuery is a fully serverless columnar warehouse with no infrastructure to manage and automatic scaling to essentially unlimited compute. It separates storage (Colossus) from compute (Dremel) at the architecture level, charges per query by bytes scanned, and handles petabyte-scale datasets without any configuration. Its serverless model makes it the fastest path to querying very large datasets — but also requires careful cost management through partitioning and clustering.

BigQuery Strengths

Zero infrastructure maintenance. Automatic scaling to enormous datasets. Excellent semi-structured data support (ARRAY, STRUCT, JSON). Fast ingestion from GCS, Pub/Sub, and Dataflow. Native ML with BigQuery ML. Gemini AI integration. Strong for global-scale analytics workloads.

BigQuery Best Fit

Organizations on Google Cloud. Teams that want serverless simplicity with massive scalability. When cost-per-query model suits the workload. Marketing analytics, advertising measurement, and Google product integration. When you need to query petabyte-scale datasets without cluster management.

BigQuery cost management is critical. BigQuery charges by bytes scanned. A query without partition filters on a large table can scan terabytes and cost hundreds of dollars in seconds. Always partition large tables by date and cluster on join columns. Use the query validator to check bytes scanned before running expensive queries in production. Covered in depth in Part 12.

-- BigQuery: Gold layer with partitioning and clustering
-- Partition by order date, cluster by customer and product keys
CREATE TABLE `project.gold.fct_order_line`
(
    order_line_key    INT64,
    order_date        DATE,
    order_date_key    INT64,
    customer_key      INT64,
    product_key       INT64,
    quantity          INT64,
    unit_price        NUMERIC,
    extended_amount   NUMERIC
)
PARTITION BY order_date          -- partition pruning on date filters
CLUSTER BY customer_key, product_key;  -- clustering on most common join columns

-- Verify partition metadata
SELECT table_name, partition_id, total_rows, total_logical_bytes
FROM `project.gold.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'fct_order_line'
ORDER BY partition_id DESC
LIMIT 10;

Reference: BigQuery Architecture Overview

6 Amazon Redshift + AWS Glue + S3 Intermediate

AWS’s warehouse architecture combines Redshift (the MPP warehouse) with S3 (the data lake storage) and AWS Glue (the ETL and catalog service). RA3 nodes in Redshift separate compute from storage, allowing you to scale them independently. Redshift Spectrum queries S3 data directly, enabling hybrid warehouse-plus-lake architectures where cold or raw data stays in S3 and hot analytical data lives in Redshift.

Redshift Strengths

Tight AWS ecosystem integration. RA3 nodes separate compute and storage. Redshift Spectrum queries S3 directly. Glue Catalog unifies metadata across Redshift and S3. Concurrency scaling handles burst workloads. Mature MPP engine for structured analytics.

Redshift Best Fit

Organizations already invested in AWS. Teams with SQL Server or Oracle migration background. When you need tight integration between warehouse and S3 lake via Spectrum. When Glue ETL is already part of your pipeline. Cost-effective for predictable large-scale workloads.

-- Redshift: distribution and sort key design for Gold dimensional model
-- Distribution key: spread rows evenly for parallel processing
-- Sort key: optimize for most common filter patterns

CREATE TABLE gold.fct_order_line (
    order_line_key   BIGINT          NOT NULL,
    order_date_key   INT             NOT NULL,
    customer_key     BIGINT          NOT NULL,
    product_key      BIGINT          NOT NULL,
    quantity         INT             NOT NULL,
    unit_price       DECIMAL(18,2)   NOT NULL,
    extended_amount  DECIMAL(18,2)   NOT NULL
)
DISTKEY (customer_key)   -- distribute by customer_key for customer-centric joins
SORTKEY (order_date_key, customer_key);  -- sort for date-range queries

-- Check table distribution quality
SELECT tbl_rows, skew_rows, pct_used
FROM svv_table_info
WHERE "table" = 'fct_order_line';

Reference: Amazon Redshift RA3 Node Types

7 Azure Synapse Dedicated SQL Pool Intermediate

Azure Synapse Analytics includes a Dedicated SQL Pool — a massively parallel processing (MPP) warehouse descended from SQL Server PDW. It is the right choice for organizations migrating large on-premises SQL Server data warehouses to Azure, as it is highly T-SQL compatible. Distribution strategy design requires more manual planning than Snowflake or BigQuery but gives fine-grained control over physical data layout.

Synapse Strengths

Massive MPP scale for structured data. T-SQL compatible — low learning curve for SQL Server teams. Tight Azure ecosystem integration. Columnstore index compression. Good for SQL Server migration paths. Supports both serverless and dedicated pools.

Synapse Best Fit

Large enterprises migrating from on-premises SQL Server DW. Teams with deep T-SQL expertise. When Azure is the primary cloud. Note: Microsoft Fabric is the strategic direction — new Synapse projects should evaluate Fabric Warehouse first before committing to Dedicated SQL Pool.

Synapse vs Fabric: Microsoft’s strategic direction for analytics on Azure is Microsoft Fabric, not Azure Synapse. Synapse Dedicated SQL Pool remains supported and appropriate for large existing workloads, but for new projects in the Microsoft ecosystem, evaluate Fabric Warehouse first. The two products coexist and will continue to for many years, but investment in new features is concentrated in Fabric.

-- Synapse Dedicated SQL Pool: distribution strategy for Gold tables
-- HASH distribution: spreads rows based on a column value hash
-- ROUND_ROBIN distribution: spreads rows evenly (use for small tables or staging)
-- REPLICATED: copies entire table to each compute node (use for small dimensions)

-- Large fact table: HASH distribution on most common join column
CREATE TABLE gold.FctOrderLine
(
    OrderLineKey    BIGINT          NOT NULL,
    OrderDateKey    INT             NOT NULL,
    CustomerKey     BIGINT          NOT NULL,
    ProductKey      BIGINT          NOT NULL,
    Quantity        INT             NOT NULL,
    ExtendedAmount  DECIMAL(18,2)   NOT NULL
)
WITH (
    DISTRIBUTION = HASH(CustomerKey),  -- distribute on join column
    CLUSTERED COLUMNSTORE INDEX        -- columnar compression for analytics
);

-- Small dimension table: REPLICATE for fast joins without shuffle
CREATE TABLE gold.DimDate
(
    DateKey     INT         NOT NULL,
    FullDate    DATE        NOT NULL,
    MonthName   VARCHAR(10) NOT NULL,
    Year        SMALLINT    NOT NULL
)
WITH (
    DISTRIBUTION = REPLICATE,          -- full copy on every node
    CLUSTERED COLUMNSTORE INDEX
);

Reference: Synapse SQL Best Practices

8 Architecture Selection Framework Beginner

Platform selection is not primarily a technical decision — it is an organizational one. The best platform is almost always the one your team already knows, your data already lives near, and your organization’s cloud strategy supports. Migration cost, retraining cost, and integration overhead almost always outweigh the performance differences between mature platforms in the same tier.

If your organization…	Start here	Why
Runs SQL Server on-premises and uses Power BI	Microsoft Fabric	Fabric Mirroring, native Power BI, T-SQL compatibility, Purview governance
Is already on Azure or Microsoft 365	Fabric Warehouse + Lakehouse	Entra ID, Copilot everywhere, zero-ETL Mirroring from Azure SQL
Is already on Snowflake	Snowflake	No migration, add Iceberg tables for lakehouse, Cortex AI for analytics
Has a strong data science team using Spark	Databricks	Best Spark engine, Delta Lake, Unity Catalog, MLflow
Is on Google Cloud with BigQuery	BigQuery + Vertex AI	Serverless, Gemini integration, Looker semantic layer
Is AWS-primary	Redshift + Glue + S3	Native AWS integration, S3 lake + Redshift warehouse hybrid
Is migrating a large on-prem SQL DW to Azure	Synapse → Fabric	Synapse for lift-and-shift compatibility; plan migration to Fabric long-term
Wants maximum open-source flexibility	Databricks or Iceberg on S3	Open formats, no storage lock-in, multi-engine access

Choose a Warehouse When

Fast SQL performance is the primary requirement
BI dashboards are the main workload
The team is SQL-first with limited Spark experience
Low operational overhead is a priority
Data is mostly structured

Choose a Data Lake When

You need to store raw, historical, or unstructured data cheaply
You run ML, Spark, or data science workloads
You have petabyte-scale storage needs
You want open formats with multi-engine access

Choose a Lakehouse When

You want SQL and Spark on the same data without copying
ACID transactions on lake storage matter
Cost efficiency at scale is important
You run both analytics and ML on the same datasets
Open formats and avoiding vendor storage lock-in are priorities

9 Architecture Diagrams Beginner

Warehouse-Centric Architecture

Lake-Centric Medallion Architecture

Sources → Bronze (Raw Files) ↓ Parquet · JSON · CSV Silver (Cleaned) ↓ Delta Lake · Iceberg Gold (Dimensional Model) ↓ Delta · Iceberg SQL Endpoint / Warehouse ↓ BI + Analytics + ML Storage: S3 · ADLS · GCS · OneLake Compute: Spark · Athena · Presto · SQL Warehouse

Lakehouse Architecture (Databricks or Fabric)

Sources ↓ Auto Loader · Snowpipe · Glue · ADF ↓ Delta / Iceberg Tables (Bronze → Silver → Gold) ↓ same storage, different endpoints ├── SQL Warehouse endpoint → Power BI · Looker · Tableau └── Spark / Python → ML · Data Science · Notebooks ↓ Unity Catalog / Purview (governs all access)

10 Mapping Architecture to Your Dimensional Model Intermediate

The dimensional model you built in Part 1 — DimCustomer, DimDate, DimProduct, FctOrderLine — is identical regardless of which architecture you choose. The schemas do not change. The grain does not change. The surrogate keys do not change. What changes is how the physical storage and compute are configured on each platform.

Platform	Physical Difference	Performance Mechanism
Snowflake	Managed micro-partitions, optional clustering keys	Automatic pruning; clustering keys for large tables
Fabric Warehouse	Columnstore indexes, T-SQL compatible DDL	Columnstore compression; result-set caching
Databricks	Delta tables, Z-ordering or Liquid Clustering	File skipping via Z-order; Photon execution engine
BigQuery	Partition by date, cluster by join columns	Partition pruning; clustering reduces bytes scanned
Redshift	Distribution key + sort key per table	Parallel execution across nodes; co-located joins
Synapse	HASH/REPLICATE distribution + columnstore index	MPP parallel execution; columnstore compression

The physical implementation details — clustering keys, distribution strategies, Z-ordering, partitioning — are covered in Parts 7 and 8 for each platform. The dimensional model schema is the same. Only the CREATE TABLE options differ.

11 Workshops

Novice

Compare Engine Behaviors

Load a simple Orders CSV into one warehouse (Snowflake, Fabric, or BigQuery)
Run: SELECT CustomerID, SUM(Amount) FROM Orders GROUP BY CustomerID
Note query elapsed time, bytes scanned, and query plan
Enable query profile or execution details
Identify which operations consumed the most cost

Intermediate

Build a Lake + Warehouse Hybrid

Create a Bronze folder in ADLS, S3, or GCS
Ingest raw data using Fivetran, ADF, or Glue
Create a Silver Delta or Parquet table
Load Gold dimensional tables into Snowflake or Fabric Warehouse
Compare query performance: lake SQL endpoint vs warehouse
Document the latency and cost difference

Advanced

Build a Delta or Iceberg Lakehouse

Create Delta Bronze tables in Databricks or Fabric
Build Silver and Gold layers using Spark SQL
Expose Gold via SQL Warehouse or SQL Analytics Endpoint
Connect Power BI or Looker to the SQL endpoint
Apply Z-ordering or clustering on join columns
Compare query times against an equivalent dedicated warehouse

References

Up next → Part 3: Building the Data Lake: Bronze, Silver, and Gold Architecture — Medallion design, folder structure, Parquet vs Delta vs Iceberg, partitioning, compaction, and ingestion examples across all platforms.

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Part 2 — Choosing the Architecture: Warehouse, Lake, or Lakehouse

1 The Three Core Architecture Patterns Beginner

Pattern 1: Data Warehouse

✓ Strengths

✗ Limitations

Pattern 2: Data Lake

✓ Strengths

✗ Limitations

Pattern 3: Lakehouse

✓ Strengths

✗ Limitations

2 Snowflake — The Managed Cloud Warehouse Beginner

Snowflake Strengths

Snowflake Best Fit

3 Microsoft Fabric — The Unified Platform Beginner

Fabric Strengths

Fabric Best Fit

4 Databricks — The Lakehouse Origin Intermediate

Databricks Strengths

Databricks Best Fit

5 Google BigQuery — The Serverless Warehouse Beginner

BigQuery Strengths

BigQuery Best Fit

6 Amazon Redshift + AWS Glue + S3 Intermediate

Redshift Strengths

Redshift Best Fit

7 Azure Synapse Dedicated SQL Pool Intermediate

Synapse Strengths

Synapse Best Fit

8 Architecture Selection Framework Beginner

Choose a Warehouse When

Choose a Data Lake When

Choose a Lakehouse When

9 Architecture Diagrams Beginner

Warehouse-Centric Architecture

Lake-Centric Medallion Architecture

Lakehouse Architecture (Databricks or Fabric)

10 Mapping Architecture to Your Dimensional Model Intermediate

11 Workshops

Novice

Intermediate

Advanced

References

Share this:

Like this:

Related

Discover more from SQLYARD

Related Posts

Leave a ReplyCancel reply

Discover more from SQLYARD