Part 2 — Choosing the Architecture: Warehouse, Lake, or Lakehouse
Once your dimensional model is defined, the next major decision is architectural. Your entire platform — how data is stored, how it is queried, what it costs, how it scales, and how much engineering it requires to maintain — depends on choosing the right combination of warehouse, data lake, and lakehouse patterns.
This part breaks down each architecture at a professional architect level. You will learn when to choose each platform, how the major cloud vendors differ in their approaches, how these patterns support the dimensional designs built in Part 1, and how to use a decision framework to choose the right fit for your organization. Every platform covered here — Snowflake, Microsoft Fabric, Databricks, Google BigQuery, Amazon Redshift, and Azure Synapse — is evaluated honestly with its strengths, limitations, and best-fit scenarios.
What this part covers: Architecture patterns and platform selection. Physical implementation details for each platform — clustering keys, distribution strategies, Delta Lake ACID, Iceberg table design — are covered in Parts 7 and 8. This part gives you the knowledge to make the right architectural choice. The later parts show you how to build it.
- The Three Core Architecture Patterns
- Snowflake — The Managed Cloud Warehouse
- Microsoft Fabric — The Unified Platform
- Databricks — The Lakehouse Origin
- Google BigQuery — The Serverless Warehouse
- Amazon Redshift + AWS Glue + S3
- Azure Synapse Dedicated SQL Pool
- Architecture Selection Framework
- Architecture Diagrams
- Mapping Architecture to Your Dimensional Model
- Workshops
- References
1 The Three Core Architecture Patterns Beginner
Modern analytical systems fall into one of three patterns — or a combination of them. Understanding the distinction is essential before evaluating any specific platform, because the same platform can be used in different architectural patterns depending on how you configure it.
Pattern 1: Data Warehouse
A cloud data warehouse is a SQL-optimized compute and storage engine designed for structured, curated data. It stores your dimensional model directly and serves queries from BI tools. The warehouse is the destination for your Gold layer.
✓ Strengths
- Dimensional models run extremely fast — columnar engines are purpose-built for star schema aggregations
- BI tools connect directly with no additional infrastructure
- Handles massive concurrent aggregations efficiently
- Predictable performance with well-understood tuning knobs
- Strong concurrency and isolation
- Low operational overhead on managed platforms
✗ Limitations
- Less flexible with semi-structured and unstructured data
- Historical raw data usually lives outside the warehouse in a lake
- Cost can increase with large-volume raw data ingestion
- Proprietary storage formats on some platforms limit portability
Best examples: Snowflake, Fabric Warehouse, BigQuery, Redshift, Synapse Dedicated SQL Pool
Pattern 2: Data Lake
A data lake stores all data — structured, unstructured, semi-structured — on low-cost object storage. It is the raw archive layer and the foundation for Bronze and Silver in the medallion architecture. It does not provide SQL query performance on its own — that requires a compute engine layered on top.
✓ Strengths
- Extremely low storage cost at any scale
- Holds raw, historical, and experimental data
- Supports all formats — JSON, CSV, Parquet, Delta, Iceberg, ORC
- Enables ML workloads, Spark processing, and data experimentation
- Open formats allow multi-engine access (Spark, Athena, Presto, Trino)
✗ Limitations
- Requires engineering discipline to avoid becoming a “data swamp”
- SQL query performance depends on the compute engine layered on top
- Harder to govern without a dedicated metadata catalog
- No ACID transactions without Delta Lake or Iceberg table format
Best examples: Azure Data Lake Storage Gen2 (ADLS), Amazon S3, Google Cloud Storage, OneLake (Fabric)
Pattern 3: Lakehouse
A lakehouse merges the reliability and SQL performance of a warehouse with the flexibility and cost efficiency of a lake. It stores data in open table formats (Delta Lake or Apache Iceberg) on object storage, adds ACID transaction guarantees, and exposes a SQL endpoint for BI tools while simultaneously supporting Spark, Python, and ML workloads on the same data.
✓ Strengths
- One storage layer serves both SQL analytics and ML/data science
- Open formats (Delta, Iceberg) — no vendor lock-in on storage
- ACID transactions on lake storage — reliable incremental updates
- Time travel and version history built in
- Cost-efficient — lake storage prices at petabyte scale
- Multi-engine access — Spark, SQL, Python all work on the same tables
✗ Limitations
- Requires stronger engineering discipline than a managed warehouse
- Some BI tools still prefer warehouse endpoints over lakehouse SQL endpoints
- File management (compaction, vacuuming) requires ongoing attention
- Ecosystem maturity varies — Databricks most mature, others still catching up
Best examples: Databricks Delta Lake, Microsoft Fabric Lakehouse, Snowflake with Native Iceberg Tables, AWS with Iceberg on S3, BigQuery with BigLake External Tables
2 Snowflake — The Managed Cloud Warehouse Beginner
Snowflake is a fully managed cloud warehouse that separates compute and storage. It handles nearly all infrastructure decisions for you — auto-scaling, auto-suspend, micro-partitioning, and query optimization — making it the lowest-friction path to production SQL analytics. As of 2026 it also supports native Iceberg tables, allowing Snowflake to function as a lakehouse for organizations that want open formats on Snowflake-managed storage.
Snowflake Strengths
Auto-scaling and auto-suspend virtual warehouses. Near-zero maintenance. Consistent fast SQL. Excellent for dimensional models. Time travel up to 90 days. Native Iceberg and external table support. Supports Python, Java, SQL, and Snowpark. Multi-cloud (AWS, Azure, GCP).
Snowflake Best Fit
Organizations that want a simple, fast, SQL-first warehouse. Teams with minimal operational overhead requirements. When you want great performance without managing infrastructure. When your workload is primarily SQL with some Python via Snowpark.
-- Snowflake: your Gold dimensional model sits here directly
-- No additional configuration needed for basic star schema performance
-- For large tables (100M+ rows) add a clustering key:
ALTER TABLE gold.FctOrderLine
CLUSTER BY (OrderDateKey, CustomerKey);
-- Verify clustering state
SELECT SYSTEM$CLUSTERING_INFORMATION('gold.FctOrderLine');
-- Check micro-partition efficiency for a query
EXPLAIN
SELECT d.MonthName, SUM(f.ExtendedAmount)
FROM gold.FctOrderLine f
JOIN gold.DimDate d ON f.OrderDateKey = d.DateKey
WHERE d.Year = 2025
GROUP BY d.MonthName;
Reference: Snowflake Key Architecture Concepts
3 Microsoft Fabric — The Unified Platform Beginner
Microsoft Fabric unifies data engineering, data science, Power BI, streaming analytics, and governance under a single platform backed by OneLake — a single organizational data lake that all Fabric workloads share. It provides both a Lakehouse (Delta tables with a SQL analytics endpoint) and a Warehouse (a dedicated SQL compute engine) within the same environment. Power BI is native — semantic models connect directly to Fabric data with no data movement.
Fabric Strengths
Lakehouse and Warehouse coexist on the same OneLake storage. Power BI is native — Direct Lake mode for zero-copy reporting. Fabric Mirroring brings operational SQL Server data in within seconds. Copilot integration across all workloads. Strong governance via Microsoft Purview.
Fabric Best Fit
Organizations heavily invested in Power BI and the Microsoft ecosystem. Teams that want warehouse and lakehouse capabilities together. When governance, Purview integration, and Entra ID identity matter. When you want near-real-time operational analytics via Fabric Mirroring.
-- Fabric Warehouse: Gold dimensional model runs as standard T-SQL
-- Same DDL as SQL Server -- familiar for SQL Server DBAs
CREATE TABLE gold.FctOrderLine (
OrderLineKey BIGINT NOT NULL,
OrderDateKey INT NOT NULL,
CustomerKey BIGINT NOT NULL,
ProductKey BIGINT NOT NULL,
Quantity INT NOT NULL,
UnitPrice DECIMAL(18,2) NOT NULL,
ExtendedAmount DECIMAL(18,2) NOT NULL
);
-- Fabric Lakehouse: same data as Delta tables with SQL analytics endpoint
-- Connect SSMS or Power BI to the SQL analytics endpoint:
-- Server: .sql.fabric.microsoft.com
-- Database:
-- Query Gold Delta tables from the SQL endpoint (read-only)
SELECT d.MonthName, SUM(f.ExtendedAmount) AS Revenue
FROM gold.FctOrderLine f
JOIN gold.DimDate d ON f.OrderDateKey = d.DateKey
GROUP BY d.MonthName;
Reference: Microsoft Fabric Lakehouse Overview
4 Databricks — The Lakehouse Origin Intermediate
Databricks created Delta Lake and coined the term “lakehouse.” It remains the most mature and most capable open-source-aligned lakehouse platform, particularly for organizations with heavy Spark workloads, ML pipelines, and complex data engineering. The Photon execution engine provides near-warehouse SQL performance on Delta tables. Unity Catalog provides unified governance across all data and ML assets.
Databricks Strengths
Best-in-class Spark engine with Photon acceleration. Delta Lake ACID transactions on all tables. Unity Catalog for unified data and ML governance. Excellent for ML, AI, and streaming at scale. SQL Warehouse provides fast BI-ready queries. Open formats — no storage vendor lock-in.
Databricks Best Fit
Large data volumes with complex transformations. Organizations running ML and AI pipelines alongside analytics. Teams comfortable with Spark and Python. When open-format storage and maximum flexibility matter. When Unity Catalog governance is required across data and models.
-- Databricks: Gold layer as Delta tables in Unity Catalog
-- Create catalog and schema structure
CREATE CATALOG IF NOT EXISTS prod;
CREATE SCHEMA IF NOT EXISTS prod.gold;
-- Create dimensional tables as Delta
CREATE TABLE prod.gold.dim_customer (
customer_key BIGINT NOT NULL,
customer_natural_key STRING NOT NULL,
full_name STRING,
email STRING,
city STRING,
state_province STRING,
country STRING,
loyalty_tier STRING,
effective_from DATE NOT NULL,
effective_to DATE NOT NULL,
is_current INT NOT NULL
) USING DELTA
CLUSTER BY (customer_key); -- liquid clustering (Databricks default since DBR 13.3)
-- Optimize Delta table and apply Z-ordering on high-cardinality join columns
OPTIMIZE prod.gold.fct_order_line
ZORDER BY (customer_key, order_date_key);
-- Time travel -- query Gold as it existed 7 days ago
SELECT * FROM prod.gold.fct_order_line
TIMESTAMP AS OF date_sub(current_date(), 7)
LIMIT 100;
Reference: Databricks Lakehouse Platform
5 Google BigQuery — The Serverless Warehouse Beginner
BigQuery is a fully serverless columnar warehouse with no infrastructure to manage and automatic scaling to essentially unlimited compute. It separates storage (Colossus) from compute (Dremel) at the architecture level, charges per query by bytes scanned, and handles petabyte-scale datasets without any configuration. Its serverless model makes it the fastest path to querying very large datasets — but also requires careful cost management through partitioning and clustering.
BigQuery Strengths
Zero infrastructure maintenance. Automatic scaling to enormous datasets. Excellent semi-structured data support (ARRAY, STRUCT, JSON). Fast ingestion from GCS, Pub/Sub, and Dataflow. Native ML with BigQuery ML. Gemini AI integration. Strong for global-scale analytics workloads.
BigQuery Best Fit
Organizations on Google Cloud. Teams that want serverless simplicity with massive scalability. When cost-per-query model suits the workload. Marketing analytics, advertising measurement, and Google product integration. When you need to query petabyte-scale datasets without cluster management.
BigQuery cost management is critical. BigQuery charges by bytes scanned. A query without partition filters on a large table can scan terabytes and cost hundreds of dollars in seconds. Always partition large tables by date and cluster on join columns. Use the query validator to check bytes scanned before running expensive queries in production. Covered in depth in Part 12.
-- BigQuery: Gold layer with partitioning and clustering
-- Partition by order date, cluster by customer and product keys
CREATE TABLE `project.gold.fct_order_line`
(
order_line_key INT64,
order_date DATE,
order_date_key INT64,
customer_key INT64,
product_key INT64,
quantity INT64,
unit_price NUMERIC,
extended_amount NUMERIC
)
PARTITION BY order_date -- partition pruning on date filters
CLUSTER BY customer_key, product_key; -- clustering on most common join columns
-- Verify partition metadata
SELECT table_name, partition_id, total_rows, total_logical_bytes
FROM `project.gold.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'fct_order_line'
ORDER BY partition_id DESC
LIMIT 10;
Reference: BigQuery Architecture Overview
6 Amazon Redshift + AWS Glue + S3 Intermediate
AWS’s warehouse architecture combines Redshift (the MPP warehouse) with S3 (the data lake storage) and AWS Glue (the ETL and catalog service). RA3 nodes in Redshift separate compute from storage, allowing you to scale them independently. Redshift Spectrum queries S3 data directly, enabling hybrid warehouse-plus-lake architectures where cold or raw data stays in S3 and hot analytical data lives in Redshift.
Redshift Strengths
Tight AWS ecosystem integration. RA3 nodes separate compute and storage. Redshift Spectrum queries S3 directly. Glue Catalog unifies metadata across Redshift and S3. Concurrency scaling handles burst workloads. Mature MPP engine for structured analytics.
Redshift Best Fit
Organizations already invested in AWS. Teams with SQL Server or Oracle migration background. When you need tight integration between warehouse and S3 lake via Spectrum. When Glue ETL is already part of your pipeline. Cost-effective for predictable large-scale workloads.
-- Redshift: distribution and sort key design for Gold dimensional model
-- Distribution key: spread rows evenly for parallel processing
-- Sort key: optimize for most common filter patterns
CREATE TABLE gold.fct_order_line (
order_line_key BIGINT NOT NULL,
order_date_key INT NOT NULL,
customer_key BIGINT NOT NULL,
product_key BIGINT NOT NULL,
quantity INT NOT NULL,
unit_price DECIMAL(18,2) NOT NULL,
extended_amount DECIMAL(18,2) NOT NULL
)
DISTKEY (customer_key) -- distribute by customer_key for customer-centric joins
SORTKEY (order_date_key, customer_key); -- sort for date-range queries
-- Check table distribution quality
SELECT tbl_rows, skew_rows, pct_used
FROM svv_table_info
WHERE "table" = 'fct_order_line';
Reference: Amazon Redshift RA3 Node Types
7 Azure Synapse Dedicated SQL Pool Intermediate
Azure Synapse Analytics includes a Dedicated SQL Pool — a massively parallel processing (MPP) warehouse descended from SQL Server PDW. It is the right choice for organizations migrating large on-premises SQL Server data warehouses to Azure, as it is highly T-SQL compatible. Distribution strategy design requires more manual planning than Snowflake or BigQuery but gives fine-grained control over physical data layout.
Synapse Strengths
Massive MPP scale for structured data. T-SQL compatible — low learning curve for SQL Server teams. Tight Azure ecosystem integration. Columnstore index compression. Good for SQL Server migration paths. Supports both serverless and dedicated pools.
Synapse Best Fit
Large enterprises migrating from on-premises SQL Server DW. Teams with deep T-SQL expertise. When Azure is the primary cloud. Note: Microsoft Fabric is the strategic direction — new Synapse projects should evaluate Fabric Warehouse first before committing to Dedicated SQL Pool.
Synapse vs Fabric: Microsoft’s strategic direction for analytics on Azure is Microsoft Fabric, not Azure Synapse. Synapse Dedicated SQL Pool remains supported and appropriate for large existing workloads, but for new projects in the Microsoft ecosystem, evaluate Fabric Warehouse first. The two products coexist and will continue to for many years, but investment in new features is concentrated in Fabric.
-- Synapse Dedicated SQL Pool: distribution strategy for Gold tables
-- HASH distribution: spreads rows based on a column value hash
-- ROUND_ROBIN distribution: spreads rows evenly (use for small tables or staging)
-- REPLICATED: copies entire table to each compute node (use for small dimensions)
-- Large fact table: HASH distribution on most common join column
CREATE TABLE gold.FctOrderLine
(
OrderLineKey BIGINT NOT NULL,
OrderDateKey INT NOT NULL,
CustomerKey BIGINT NOT NULL,
ProductKey BIGINT NOT NULL,
Quantity INT NOT NULL,
ExtendedAmount DECIMAL(18,2) NOT NULL
)
WITH (
DISTRIBUTION = HASH(CustomerKey), -- distribute on join column
CLUSTERED COLUMNSTORE INDEX -- columnar compression for analytics
);
-- Small dimension table: REPLICATE for fast joins without shuffle
CREATE TABLE gold.DimDate
(
DateKey INT NOT NULL,
FullDate DATE NOT NULL,
MonthName VARCHAR(10) NOT NULL,
Year SMALLINT NOT NULL
)
WITH (
DISTRIBUTION = REPLICATE, -- full copy on every node
CLUSTERED COLUMNSTORE INDEX
);
Reference: Synapse SQL Best Practices
8 Architecture Selection Framework Beginner
Platform selection is not primarily a technical decision — it is an organizational one. The best platform is almost always the one your team already knows, your data already lives near, and your organization’s cloud strategy supports. Migration cost, retraining cost, and integration overhead almost always outweigh the performance differences between mature platforms in the same tier.
| If your organization… | Start here | Why |
|---|---|---|
| Runs SQL Server on-premises and uses Power BI | Microsoft Fabric | Fabric Mirroring, native Power BI, T-SQL compatibility, Purview governance |
| Is already on Azure or Microsoft 365 | Fabric Warehouse + Lakehouse | Entra ID, Copilot everywhere, zero-ETL Mirroring from Azure SQL |
| Is already on Snowflake | Snowflake | No migration, add Iceberg tables for lakehouse, Cortex AI for analytics |
| Has a strong data science team using Spark | Databricks | Best Spark engine, Delta Lake, Unity Catalog, MLflow |
| Is on Google Cloud with BigQuery | BigQuery + Vertex AI | Serverless, Gemini integration, Looker semantic layer |
| Is AWS-primary | Redshift + Glue + S3 | Native AWS integration, S3 lake + Redshift warehouse hybrid |
| Is migrating a large on-prem SQL DW to Azure | Synapse → Fabric | Synapse for lift-and-shift compatibility; plan migration to Fabric long-term |
| Wants maximum open-source flexibility | Databricks or Iceberg on S3 | Open formats, no storage lock-in, multi-engine access |
Choose a Warehouse When
- Fast SQL performance is the primary requirement
- BI dashboards are the main workload
- The team is SQL-first with limited Spark experience
- Low operational overhead is a priority
- Data is mostly structured
Choose a Data Lake When
- You need to store raw, historical, or unstructured data cheaply
- You run ML, Spark, or data science workloads
- You have petabyte-scale storage needs
- You want open formats with multi-engine access
Choose a Lakehouse When
- You want SQL and Spark on the same data without copying
- ACID transactions on lake storage matter
- Cost efficiency at scale is important
- You run both analytics and ML on the same datasets
- Open formats and avoiding vendor storage lock-in are priorities
9 Architecture Diagrams Beginner
Warehouse-Centric Architecture
Lake-Centric Medallion Architecture
Lakehouse Architecture (Databricks or Fabric)
10 Mapping Architecture to Your Dimensional Model Intermediate
The dimensional model you built in Part 1 — DimCustomer, DimDate, DimProduct, FctOrderLine — is identical regardless of which architecture you choose. The schemas do not change. The grain does not change. The surrogate keys do not change. What changes is how the physical storage and compute are configured on each platform.
| Platform | Physical Difference | Performance Mechanism |
|---|---|---|
| Snowflake | Managed micro-partitions, optional clustering keys | Automatic pruning; clustering keys for large tables |
| Fabric Warehouse | Columnstore indexes, T-SQL compatible DDL | Columnstore compression; result-set caching |
| Databricks | Delta tables, Z-ordering or Liquid Clustering | File skipping via Z-order; Photon execution engine |
| BigQuery | Partition by date, cluster by join columns | Partition pruning; clustering reduces bytes scanned |
| Redshift | Distribution key + sort key per table | Parallel execution across nodes; co-located joins |
| Synapse | HASH/REPLICATE distribution + columnstore index | MPP parallel execution; columnstore compression |
The physical implementation details — clustering keys, distribution strategies, Z-ordering, partitioning — are covered in Parts 7 and 8 for each platform. The dimensional model schema is the same. Only the CREATE TABLE options differ.
11 Workshops
Novice
Compare Engine Behaviors
- Load a simple Orders CSV into one warehouse (Snowflake, Fabric, or BigQuery)
- Run:
SELECT CustomerID, SUM(Amount) FROM Orders GROUP BY CustomerID - Note query elapsed time, bytes scanned, and query plan
- Enable query profile or execution details
- Identify which operations consumed the most cost
Intermediate
Build a Lake + Warehouse Hybrid
- Create a Bronze folder in ADLS, S3, or GCS
- Ingest raw data using Fivetran, ADF, or Glue
- Create a Silver Delta or Parquet table
- Load Gold dimensional tables into Snowflake or Fabric Warehouse
- Compare query performance: lake SQL endpoint vs warehouse
- Document the latency and cost difference
Advanced
Build a Delta or Iceberg Lakehouse
- Create Delta Bronze tables in Databricks or Fabric
- Build Silver and Gold layers using Spark SQL
- Expose Gold via SQL Warehouse or SQL Analytics Endpoint
- Connect Power BI or Looker to the SQL endpoint
- Apply Z-ordering or clustering on join columns
- Compare query times against an equivalent dedicated warehouse
References
- Snowflake — Key Architecture Concepts
- Microsoft Fabric — Lakehouse Overview
- Databricks — Lakehouse Platform
- BigQuery — Architecture Overview
- Amazon Redshift — RA3 Node Types
- Azure Synapse — SQL Best Practices
- Delta Lake Documentation
- Apache Iceberg
- Microsoft Fabric Warehouse Documentation
- Databricks Unity Catalog
Discover more from SQLYARD
Subscribe to get the latest posts sent to your email.


