PART 3 — Building the Data Lake (Bronze, Silver, Gold Architecture)

Deep Technical Series: Building a Modern Data Warehouse and Data Lake

A well-designed data lake is the backbone of a modern analytics platform. Whether you ultimately use a warehouse like Snowflake, a lakehouse like Fabric or Databricks, or hybrid architectures across AWS and GCP, the lake must be structured in a way that preserves raw data, supports scalable transformations, and aligns cleanly to your dimensional model.

This part covers the complete medallion architecture: Bronze, Silver, and Gold. You will learn how to lay out folder structures, choose table formats, apply partitioning strategies, optimize file sizes, and build the transformation pipeline that feeds your dimensional models. All examples are cloud-agnostic and work in ADLS, S3, GCS, and OneLake.


Bronze, Silver, Gold: The Medallion Architecture

The medallion paradigm originated with Databricks and has since become the industry standard for lake and lakehouse design.
Reference: Databricks Medallion Architecture
https://www.databricks.com/glossary/medallion-architecture

It defines three layers, each with clear responsibilities:


Bronze Layer: Raw Ingestion

The Bronze layer stores data exactly as received, with:

  • No transformations
  • No deduplication
  • No type casting
  • No business rules
  • Full schema drift allowed
  • Immutable append-only behavior

Bronze is your “source-of-truth archive.”
It allows you to replay pipelines, audit upstream sources, and rebuild Silver or Gold whenever needed.

Typical Bronze Formats

  • JSON (API payloads, logs, CDC streams)
  • CSV (bulk exports, flat files)
  • Parquet (optimized extracts)
  • Delta Lake for raw tables in Databricks or Fabric
    https://docs.delta.io/latest/

Data arrives via:


Silver Layer: Clean and Conformed

Silver is where the lake begins to take shape. Bronze is raw, but Silver is:

  • Cleaned
  • Deduped
  • Type-enforced
  • Canonicalized
  • Flattened from nested formats
  • Validation applied
  • Conformed natural keys
  • Null handling standardized

Silver tables align with staging models in dbt or SQL logic.
They feed both the warehouse and the Gold layer.

Silver is also where you begin to optimize performance by using:

Silver tables represent the “single version of cleaned truth” before dimensional modeling.


Gold Layer: Dimensional and Analytics-Ready

Gold tables map directly to the dimensional model you built in Part 1.

Gold is where you apply:

  • Surrogate keys
  • SCD Type 2
  • Fact table grain enforcement
  • Metric definitions
  • Business logic
  • KPI shaping
  • Conformed dimensions

For lakehouses like Fabric and Databricks, Gold tables are typically:

  • Delta Lake tables with ACID
  • Iceberg tables with hidden partitioning
  • Queried directly via SQL Warehouse or a Fabric Warehouse endpoint

Gold is also where BI tools connect.


Folder Structure: Vendor-Neutral and Scalable

A clean folder design is critical. It affects discovery, performance, schema management, and downstream ingestion.

Recommended structure:

bash
/lake/
    /bronze/
        /crm/customer/year=2025/month=01/day=15/*.json
        /ecommerce/orders/year=2025/month=01/day=15/*.json

    /silver/
        /crm/customer/customer_clean.delta
        /ecommerce/orders/orders_clean.delta

    /gold/
        /dim/dim_customer.delta
        /dim/dim_product.delta
        /fact/fct_order_line.delta

This structure supports:


Choosing File Formats: Parquet, Delta, Iceberg

A production lake uses open columnar formats because they enable efficient scans and interoperability.


Parquet

The universal default.
Efficient, columnar, compressed.

Best for Bronze storage and simple Silver tables.

Reference: Parquet
https://parquet.apache.org/


Delta Lake

Extends Parquet with:

Ideal for Silver and Gold.

Reference: Delta Lake
https://docs.delta.io/latest/


Apache Iceberg

High-performance table format with:

  • Hidden partitioning
  • Partition evolution
  • Snapshot isolation
  • Git-like metadata
  • Broad engine support: Snowflake, Databricks, AWS Athena, BigQuery, Flink

Reference: Iceberg
https://iceberg.apache.org/

Iceberg is ideal if you want open-format lakehouse storage with multi-engine flexibility.


Bronze Layer: Ingestion Examples

Databricks Auto Loader (Recommended for continuous file ingestion)

python
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/mnt/raw/crm/customer"))

(df.writeStream
   .format("delta")
   .option("checkpointLocation", "/mnt/checkpoints/customer")
   .start("/mnt/bronze/crm/customer"))

Auto Loader handles incremental discovery with schema inference.
Reference: Auto Loader
https://docs.databricks.com/ingestion/auto-loader/index.html


Snowpipe (Auto-ingest into Snowflake)

CREATE OR REPLACE PIPE customer_pipe AS
COPY INTO bronze_crm_customer
FROM @raw_stage
FILE_FORMAT = (TYPE = 'JSON');

Reference: Snowpipe
https://docs.snowflake.com/en/user-guide/data-load-snowpipe


AWS Glue Job for Bronze

python
df = glue_context.create_dynamic_frame.from_options(
    "s3",
    {"paths": ["s3://lake/raw/crm/customer"]},
    format="json"
)

df.toDF().write.mode("append").parquet("s3://lake/bronze/crm/customer")

Reference: AWS Glue
https://docs.aws.amazon.com/glue/


Fabric Data Pipelines

Fabric uses streamlined Copy Activities with OneLake integration.
Reference: Fabric Data Pipelines
https://learn.microsoft.com/en-us/fabric/data-factory/


Silver Layer: Cleaning and Conformance

Silver transforms Bronze into a usable, structured schema.

Databricks Spark Silver Example

python
bronze = spark.read.format("delta").load("/lake/bronze/crm/customer")

silver = (bronze
    .dropDuplicates(["customer_id"])
    .withColumn("email", lower("email"))
    .withColumn("customer_name", initcap("customer_name"))
    .filter("customer_id IS NOT NULL")
)

silver.write.format("delta").mode("overwrite").save("/lake/silver/crm/customer")

Fabric Lakehouse Silver Example

CREATE OR ALTER VIEW silver.Customer AS
SELECT
    CAST(customer_id AS BIGINT) AS CustomerID,
    INITCAP(customer_name) AS CustomerName,
    LOWER(email) AS Email,
    City,
    State,
    Country,
    UpdatedAt
FROM lakehouse_raw.crm_customer;

Reference: Fabric Lakehouse SQL
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-sql-analytics-endpoint


BigQuery Silver Example

CREATE OR REPLACE TABLE silver.customer AS
SELECT
    SAFE_CAST(customer_id AS INT64) AS customer_id,
    INITCAP(customer_name) AS customer_name,
    LOWER(email) AS email,
    city,
    state,
    country,
    updated_at
FROM bronze.customer;

Gold Layer: Dimensional Modeling

Gold tables are the dimensional models from Part 1.

Gold DimCustomer (SQL Example)

sql
CREATE TABLE gold.DimCustomer AS
SELECT
    ROW_NUMBER() OVER(ORDER BY CustomerID) AS CustomerKey,
    CustomerID AS CustomerNaturalKey,
    CustomerName,
    Email,
    City,
    State,
    Country,
    UpdatedAt AS EffectiveFrom,
    DATE '9999-12-31' AS EffectiveTo,
    1 AS IsCurrent
FROM silver.Customer;

Gold Fact Table Example (FctOrderLine)

CREATE TABLE gold.FctOrderLine AS
SELECT
    ROW_NUMBER() OVER (ORDER BY o.order_id) AS OrderLineKey,
    d.DateKey,
    c.CustomerKey,
    p.ProductKey,
    o.quantity,
    o.unit_price,
    o.quantity * o.unit_price AS ExtendedAmount
FROM silver.orders o
JOIN gold.DimDate d ON o.order_date = d.FullDate
JOIN gold.DimCustomer c ON o.customer_id = c.CustomerNaturalKey
JOIN gold.DimProduct p ON o.product_id = p.ProductNaturalKey;

Partitioning Strategy

Partitioning is critical for:

  • Query performance
  • Scan reduction
  • Cost efficiency

Use partitions when:

  • Data is large (100M+ rows)
  • Date filters are common
  • Tenant or region segmentation exists

Best practices

  • Partition by date or region
  • Avoid high-cardinality columns
  • Target file sizes of 128 MB to 1 GB
  • Too many small files degrade performance

Reference: Delta Lake File Sizing
https://docs.delta.io/latest/best-practices.html


Compaction and File Optimization

Small files kill performance across Databricks, Fabric, Athena, Synapse Serverless, and BigQuery.

Databricks Delta Optimization

OPTIMIZE delta.`/lake/silver/crm/customer`
ZORDER BY (customer_id);

Reference: Z-ordering
https://docs.databricks.com/delta/optimizations/zorder.html


Fabric Optimization

Fabric automatically optimizes Delta files but provides maintenance tooling for compaction.
Reference: Fabric Optimization
https://learn.microsoft.com/en-us/fabric/data-engineering/optimize-lakehouse


AWS Glue Compaction

Use a Glue ETL job to merge small Parquet files into larger ones.


Workshop: Novice, Intermediate, Advanced


Novice Workshop: Bronze to Silver

Goal: Understand medallion flow.

  1. Load CRM JSON files into Bronze
  2. Create a Silver table with cleaned, deduped data
  3. Query both and compare counts
  4. Document differences

Intermediate Workshop: Silver to Gold

Goal: Build dimensional models.

  1. Build Silver orders and customers
  2. Create DimCustomer and DimDate
  3. Create FctOrderLine
  4. Validate grain consistency

Advanced Workshop: Optimize the Lakehouse

Goal: Apply performance engineering.

  1. Convert Silver to Delta or Iceberg
  2. Run compaction
  3. Apply Z-order or clustering
  4. Compare query times before and after
  5. Run filtering queries to prove partition pruning

References

Delta Lake Documentation
https://docs.delta.io/latest/

Apache Iceberg
https://iceberg.apache.org/

Databricks Medallion Architecture
https://www.databricks.com/glossary/medallion-architecture

Databricks Auto Loader
https://docs.databricks.com/ingestion/auto-loader/index.html

Fabric Lakehouse Overview
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-overview

Fabric OneLake Shortcuts
https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcuts

Snowflake External Tables
https://docs.snowflake.com/en/user-guide/tables-external-intro

Snowpipe
https://docs.snowflake.com/en/user-guide/data-load-snowpipe

AWS Glue
https://docs.aws.amazon.com/glue/

BigLake Tables
https://cloud.google.com/biglake

Parquet
https://parquet.apache.org/

Delta Z-order Optimization
https://docs.databricks.com/delta/optimizations/zorder.html


Discover more from SQLyard

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLyard

Subscribe now to keep reading and get access to the full archive.

Continue reading