Deep Technical Series: Building a Modern Data Warehouse and Data Lake
A well-designed data lake is the backbone of a modern analytics platform. Whether you ultimately use a warehouse like Snowflake, a lakehouse like Fabric or Databricks, or hybrid architectures across AWS and GCP, the lake must be structured in a way that preserves raw data, supports scalable transformations, and aligns cleanly to your dimensional model.
This part covers the complete medallion architecture: Bronze, Silver, and Gold. You will learn how to lay out folder structures, choose table formats, apply partitioning strategies, optimize file sizes, and build the transformation pipeline that feeds your dimensional models. All examples are cloud-agnostic and work in ADLS, S3, GCS, and OneLake.
Bronze, Silver, Gold: The Medallion Architecture
The medallion paradigm originated with Databricks and has since become the industry standard for lake and lakehouse design.
Reference: Databricks Medallion Architecture
https://www.databricks.com/glossary/medallion-architecture
It defines three layers, each with clear responsibilities:
Bronze Layer: Raw Ingestion
The Bronze layer stores data exactly as received, with:
- No transformations
- No deduplication
- No type casting
- No business rules
- Full schema drift allowed
- Immutable append-only behavior
Bronze is your “source-of-truth archive.”
It allows you to replay pipelines, audit upstream sources, and rebuild Silver or Gold whenever needed.
Typical Bronze Formats
- JSON (API payloads, logs, CDC streams)
- CSV (bulk exports, flat files)
- Parquet (optimized extracts)
- Delta Lake for raw tables in Databricks or Fabric
https://docs.delta.io/latest/
Data arrives via:
- Fivetran
https://www.fivetran.com/docs - Airbyte
https://airbyte.com/ - Databricks Auto Loader
https://docs.databricks.com/ingestion/auto-loader/index.html - Snowpipe
https://docs.snowflake.com/en/user-guide/data-load-snowpipe - AWS Glue Jobs
https://docs.aws.amazon.com/glue/ - ADF or Fabric Data Pipelines
https://learn.microsoft.com/en-us/fabric/data-factory/
Silver Layer: Clean and Conformed
Silver is where the lake begins to take shape. Bronze is raw, but Silver is:
- Cleaned
- Deduped
- Type-enforced
- Canonicalized
- Flattened from nested formats
- Validation applied
- Conformed natural keys
- Null handling standardized
Silver tables align with staging models in dbt or SQL logic.
They feed both the warehouse and the Gold layer.
Silver is also where you begin to optimize performance by using:
- Delta Lake
- Apache Iceberg
https://iceberg.apache.org/ - Parquet with clean schemas
Silver tables represent the “single version of cleaned truth” before dimensional modeling.
Gold Layer: Dimensional and Analytics-Ready
Gold tables map directly to the dimensional model you built in Part 1.
Gold is where you apply:
- Surrogate keys
- SCD Type 2
- Fact table grain enforcement
- Metric definitions
- Business logic
- KPI shaping
- Conformed dimensions
For lakehouses like Fabric and Databricks, Gold tables are typically:
- Delta Lake tables with ACID
- Iceberg tables with hidden partitioning
- Queried directly via SQL Warehouse or a Fabric Warehouse endpoint
Gold is also where BI tools connect.
Folder Structure: Vendor-Neutral and Scalable
A clean folder design is critical. It affects discovery, performance, schema management, and downstream ingestion.
Recommended structure:
bash
/lake/
/bronze/
/crm/customer/year=2025/month=01/day=15/*.json
/ecommerce/orders/year=2025/month=01/day=15/*.json
/silver/
/crm/customer/customer_clean.delta
/ecommerce/orders/orders_clean.delta
/gold/
/dim/dim_customer.delta
/dim/dim_product.delta
/fact/fct_order_line.delta
This structure supports:
- Parquet partition pruning
- Delta Z-ordering in Databricks
- Iceberg hidden partitioning
- Fabric OneLake shortcuts
https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcuts - Snowflake External Table mapping
https://docs.snowflake.com/en/user-guide/tables-external-intro - BigQuery BigLake external tables
https://cloud.google.com/biglake
Choosing File Formats: Parquet, Delta, Iceberg
A production lake uses open columnar formats because they enable efficient scans and interoperability.
Parquet
The universal default.
Efficient, columnar, compressed.
Best for Bronze storage and simple Silver tables.
Reference: Parquet
https://parquet.apache.org/
Delta Lake
Extends Parquet with:
- ACID transactions
- Schema evolution
- Time travel
- File compaction
- Z-order indexing for selective queries
https://docs.databricks.com/delta/optimizations/zorder.html
Ideal for Silver and Gold.
Reference: Delta Lake
https://docs.delta.io/latest/
Apache Iceberg
High-performance table format with:
- Hidden partitioning
- Partition evolution
- Snapshot isolation
- Git-like metadata
- Broad engine support: Snowflake, Databricks, AWS Athena, BigQuery, Flink
Reference: Iceberg
https://iceberg.apache.org/
Iceberg is ideal if you want open-format lakehouse storage with multi-engine flexibility.
Bronze Layer: Ingestion Examples
Databricks Auto Loader (Recommended for continuous file ingestion)
python
df = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.load("/mnt/raw/crm/customer"))
(df.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/customer")
.start("/mnt/bronze/crm/customer"))
Auto Loader handles incremental discovery with schema inference.
Reference: Auto Loader
https://docs.databricks.com/ingestion/auto-loader/index.html
Snowpipe (Auto-ingest into Snowflake)
CREATE OR REPLACE PIPE customer_pipe AS
COPY INTO bronze_crm_customer
FROM @raw_stage
FILE_FORMAT = (TYPE = 'JSON');
Reference: Snowpipe
https://docs.snowflake.com/en/user-guide/data-load-snowpipe
AWS Glue Job for Bronze
python
df = glue_context.create_dynamic_frame.from_options(
"s3",
{"paths": ["s3://lake/raw/crm/customer"]},
format="json"
)
df.toDF().write.mode("append").parquet("s3://lake/bronze/crm/customer")
Reference: AWS Glue
https://docs.aws.amazon.com/glue/
Fabric Data Pipelines
Fabric uses streamlined Copy Activities with OneLake integration.
Reference: Fabric Data Pipelines
https://learn.microsoft.com/en-us/fabric/data-factory/
Silver Layer: Cleaning and Conformance
Silver transforms Bronze into a usable, structured schema.
Databricks Spark Silver Example
python
bronze = spark.read.format("delta").load("/lake/bronze/crm/customer")
silver = (bronze
.dropDuplicates(["customer_id"])
.withColumn("email", lower("email"))
.withColumn("customer_name", initcap("customer_name"))
.filter("customer_id IS NOT NULL")
)
silver.write.format("delta").mode("overwrite").save("/lake/silver/crm/customer")
Fabric Lakehouse Silver Example
CREATE OR ALTER VIEW silver.Customer AS
SELECT
CAST(customer_id AS BIGINT) AS CustomerID,
INITCAP(customer_name) AS CustomerName,
LOWER(email) AS Email,
City,
State,
Country,
UpdatedAt
FROM lakehouse_raw.crm_customer;
Reference: Fabric Lakehouse SQL
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-sql-analytics-endpoint
BigQuery Silver Example
CREATE OR REPLACE TABLE silver.customer AS
SELECT
SAFE_CAST(customer_id AS INT64) AS customer_id,
INITCAP(customer_name) AS customer_name,
LOWER(email) AS email,
city,
state,
country,
updated_at
FROM bronze.customer;
Gold Layer: Dimensional Modeling
Gold tables are the dimensional models from Part 1.
Gold DimCustomer (SQL Example)
sql
CREATE TABLE gold.DimCustomer AS
SELECT
ROW_NUMBER() OVER(ORDER BY CustomerID) AS CustomerKey,
CustomerID AS CustomerNaturalKey,
CustomerName,
Email,
City,
State,
Country,
UpdatedAt AS EffectiveFrom,
DATE '9999-12-31' AS EffectiveTo,
1 AS IsCurrent
FROM silver.Customer;
Gold Fact Table Example (FctOrderLine)
CREATE TABLE gold.FctOrderLine AS
SELECT
ROW_NUMBER() OVER (ORDER BY o.order_id) AS OrderLineKey,
d.DateKey,
c.CustomerKey,
p.ProductKey,
o.quantity,
o.unit_price,
o.quantity * o.unit_price AS ExtendedAmount
FROM silver.orders o
JOIN gold.DimDate d ON o.order_date = d.FullDate
JOIN gold.DimCustomer c ON o.customer_id = c.CustomerNaturalKey
JOIN gold.DimProduct p ON o.product_id = p.ProductNaturalKey;
Partitioning Strategy
Partitioning is critical for:
- Query performance
- Scan reduction
- Cost efficiency
Use partitions when:
- Data is large (100M+ rows)
- Date filters are common
- Tenant or region segmentation exists
Best practices
- Partition by date or region
- Avoid high-cardinality columns
- Target file sizes of 128 MB to 1 GB
- Too many small files degrade performance
Reference: Delta Lake File Sizing
https://docs.delta.io/latest/best-practices.html
Compaction and File Optimization
Small files kill performance across Databricks, Fabric, Athena, Synapse Serverless, and BigQuery.
Databricks Delta Optimization
OPTIMIZE delta.`/lake/silver/crm/customer`
ZORDER BY (customer_id);
Reference: Z-ordering
https://docs.databricks.com/delta/optimizations/zorder.html
Fabric Optimization
Fabric automatically optimizes Delta files but provides maintenance tooling for compaction.
Reference: Fabric Optimization
https://learn.microsoft.com/en-us/fabric/data-engineering/optimize-lakehouse
AWS Glue Compaction
Use a Glue ETL job to merge small Parquet files into larger ones.
Workshop: Novice, Intermediate, Advanced
Novice Workshop: Bronze to Silver
Goal: Understand medallion flow.
- Load CRM JSON files into Bronze
- Create a Silver table with cleaned, deduped data
- Query both and compare counts
- Document differences
Intermediate Workshop: Silver to Gold
Goal: Build dimensional models.
- Build Silver orders and customers
- Create DimCustomer and DimDate
- Create FctOrderLine
- Validate grain consistency
Advanced Workshop: Optimize the Lakehouse
Goal: Apply performance engineering.
- Convert Silver to Delta or Iceberg
- Run compaction
- Apply Z-order or clustering
- Compare query times before and after
- Run filtering queries to prove partition pruning
References
Delta Lake Documentation
https://docs.delta.io/latest/
Apache Iceberg
https://iceberg.apache.org/
Databricks Medallion Architecture
https://www.databricks.com/glossary/medallion-architecture
Databricks Auto Loader
https://docs.databricks.com/ingestion/auto-loader/index.html
Fabric Lakehouse Overview
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview
Fabric OneLake Shortcuts
https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcuts
Snowflake External Tables
https://docs.snowflake.com/en/user-guide/tables-external-intro
Snowpipe
https://docs.snowflake.com/en/user-guide/data-load-snowpipe
AWS Glue
https://docs.aws.amazon.com/glue/
BigLake Tables
https://cloud.google.com/biglake
Parquet
https://parquet.apache.org/
Delta Z-order Optimization
https://docs.databricks.com/delta/optimizations/zorder.html
Discover more from SQLyard
Subscribe to get the latest posts sent to your email.


