Deep Technical Series: Building a Modern Data Warehouse and Data Lake

A well-designed data lake is the backbone of a modern analytics platform. Whether you ultimately use a warehouse like Snowflake, a lakehouse like Fabric or Databricks, or hybrid architectures across AWS and GCP, the lake must be structured in a way that preserves raw data, supports scalable transformations, and aligns cleanly to your dimensional model.

This part covers the complete medallion architecture: Bronze, Silver, and Gold. You will learn how to lay out folder structures, choose table formats, apply partitioning strategies, optimize file sizes, and build the transformation pipeline that feeds your dimensional models. All examples are cloud-agnostic and work in ADLS, S3, GCS, and OneLake.

Bronze, Silver, Gold: The Medallion Architecture

The medallion paradigm originated with Databricks and has since become the industry standard for lake and lakehouse design.
Reference: Databricks Medallion Architecture
https://www.databricks.com/glossary/medallion-architecture

It defines three layers, each with clear responsibilities:

Bronze Layer: Raw Ingestion

The Bronze layer stores data exactly as received, with:

No transformations
No deduplication
No type casting
No business rules
Full schema drift allowed
Immutable append-only behavior

Bronze is your “source-of-truth archive.”
It allows you to replay pipelines, audit upstream sources, and rebuild Silver or Gold whenever needed.

Typical Bronze Formats

JSON (API payloads, logs, CDC streams)
CSV (bulk exports, flat files)
Parquet (optimized extracts)
Delta Lake for raw tables in Databricks or Fabric
https://docs.delta.io/latest/

Data arrives via:

Fivetran
https://www.fivetran.com/docs
Airbyte
https://airbyte.com/
Databricks Auto Loader
https://docs.databricks.com/ingestion/auto-loader/index.html
Snowpipe
https://docs.snowflake.com/en/user-guide/data-load-snowpipe
AWS Glue Jobs
https://docs.aws.amazon.com/glue/
ADF or Fabric Data Pipelines
https://learn.microsoft.com/en-us/fabric/data-factory/

Silver Layer: Clean and Conformed

Silver is where the lake begins to take shape. Bronze is raw, but Silver is:

Cleaned
Deduped
Type-enforced
Canonicalized
Flattened from nested formats
Validation applied
Conformed natural keys
Null handling standardized

Silver tables align with staging models in dbt or SQL logic.
They feed both the warehouse and the Gold layer.

Silver is also where you begin to optimize performance by using:

Delta Lake
Apache Iceberg
https://iceberg.apache.org/
Parquet with clean schemas

Silver tables represent the “single version of cleaned truth” before dimensional modeling.

Gold Layer: Dimensional and Analytics-Ready

Gold tables map directly to the dimensional model you built in Part 1.

Gold is where you apply:

Surrogate keys
SCD Type 2
Fact table grain enforcement
Metric definitions
Business logic
KPI shaping
Conformed dimensions

For lakehouses like Fabric and Databricks, Gold tables are typically:

Delta Lake tables with ACID
Iceberg tables with hidden partitioning
Queried directly via SQL Warehouse or a Fabric Warehouse endpoint

Gold is also where BI tools connect.

Folder Structure: Vendor-Neutral and Scalable

A clean folder design is critical. It affects discovery, performance, schema management, and downstream ingestion.

Recommended structure:

bash
/lake/
    /bronze/
        /crm/customer/year=2025/month=01/day=15/*.json
        /ecommerce/orders/year=2025/month=01/day=15/*.json

    /silver/
        /crm/customer/customer_clean.delta
        /ecommerce/orders/orders_clean.delta

    /gold/
        /dim/dim_customer.delta
        /dim/dim_product.delta
        /fact/fct_order_line.delta

This structure supports:

Parquet partition pruning
Delta Z-ordering in Databricks
Iceberg hidden partitioning
Fabric OneLake shortcuts
https://learn.microsoft.com/en-us/fabric/onelake/onelake-shortcuts
Snowflake External Table mapping
https://docs.snowflake.com/en/user-guide/tables-external-intro
BigQuery BigLake external tables
https://cloud.google.com/biglake

Choosing File Formats: Parquet, Delta, Iceberg

A production lake uses open columnar formats because they enable efficient scans and interoperability.

Parquet

The universal default.
Efficient, columnar, compressed.

Best for Bronze storage and simple Silver tables.

Reference: Parquet
https://parquet.apache.org/

Delta Lake

Extends Parquet with:

ACID transactions
Schema evolution
Time travel
File compaction
Z-order indexing for selective queries
https://docs.databricks.com/delta/optimizations/zorder.html

Ideal for Silver and Gold.

Reference: Delta Lake
https://docs.delta.io/latest/

Apache Iceberg

High-performance table format with:

Hidden partitioning
Partition evolution
Snapshot isolation
Git-like metadata
Broad engine support: Snowflake, Databricks, AWS Athena, BigQuery, Flink

Reference: Iceberg
https://iceberg.apache.org/

Iceberg is ideal if you want open-format lakehouse storage with multi-engine flexibility.

Bronze Layer: Ingestion Examples

Databricks Auto Loader (Recommended for continuous file ingestion)

python
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/mnt/raw/crm/customer"))

(df.writeStream
   .format("delta")
   .option("checkpointLocation", "/mnt/checkpoints/customer")
   .start("/mnt/bronze/crm/customer"))

Auto Loader handles incremental discovery with schema inference.
Reference: Auto Loader
https://docs.databricks.com/ingestion/auto-loader/index.html

Snowpipe (Auto-ingest into Snowflake)

CREATE OR REPLACE PIPE customer_pipe AS
COPY INTO bronze_crm_customer
FROM @raw_stage
FILE_FORMAT = (TYPE = 'JSON');

Reference: Snowpipe
https://docs.snowflake.com/en/user-guide/data-load-snowpipe

AWS Glue Job for Bronze

python
df = glue_context.create_dynamic_frame.from_options(
    "s3",
    {"paths": ["s3://lake/raw/crm/customer"]},
    format="json"
)

df.toDF().write.mode("append").parquet("s3://lake/bronze/crm/customer")

Reference: AWS Glue
https://docs.aws.amazon.com/glue/

Fabric Data Pipelines

Fabric uses streamlined Copy Activities with OneLake integration.
Reference: Fabric Data Pipelines
https://learn.microsoft.com/en-us/fabric/data-factory/

Silver Layer: Cleaning and Conformance

Silver transforms Bronze into a usable, structured schema.

Databricks Spark Silver Example

python
bronze = spark.read.format("delta").load("/lake/bronze/crm/customer")

silver = (bronze
    .dropDuplicates(["customer_id"])
    .withColumn("email", lower("email"))
    .withColumn("customer_name", initcap("customer_name"))
    .filter("customer_id IS NOT NULL")
)

silver.write.format("delta").mode("overwrite").save("/lake/silver/crm/customer")

Fabric Lakehouse Silver Example

CREATE OR ALTER VIEW silver.Customer AS
SELECT
    CAST(customer_id AS BIGINT) AS CustomerID,
    INITCAP(customer_name) AS CustomerName,
    LOWER(email) AS Email,
    City,
    State,
    Country,
    UpdatedAt
FROM lakehouse_raw.crm_customer;

Reference: Fabric Lakehouse SQL
https://learn.microsoft.com/en-us/fabric/data-engineering/lakehouse-sql-analytics-endpoint

BigQuery Silver Example

CREATE OR REPLACE TABLE silver.customer AS
SELECT
    SAFE_CAST(customer_id AS INT64) AS customer_id,
    INITCAP(customer_name) AS customer_name,
    LOWER(email) AS email,
    city,
    state,
    country,
    updated_at
FROM bronze.customer;

Gold Layer: Dimensional Modeling

Gold tables are the dimensional models from Part 1.

Gold DimCustomer (SQL Example)

sql
CREATE TABLE gold.DimCustomer AS
SELECT
    ROW_NUMBER() OVER(ORDER BY CustomerID) AS CustomerKey,
    CustomerID AS CustomerNaturalKey,
    CustomerName,
    Email,
    City,
    State,
    Country,
    UpdatedAt AS EffectiveFrom,
    DATE '9999-12-31' AS EffectiveTo,
    1 AS IsCurrent
FROM silver.Customer;

Gold Fact Table Example (FctOrderLine)

CREATE TABLE gold.FctOrderLine AS
SELECT
    ROW_NUMBER() OVER (ORDER BY o.order_id) AS OrderLineKey,
    d.DateKey,
    c.CustomerKey,
    p.ProductKey,
    o.quantity,
    o.unit_price,
    o.quantity * o.unit_price AS ExtendedAmount
FROM silver.orders o
JOIN gold.DimDate d ON o.order_date = d.FullDate
JOIN gold.DimCustomer c ON o.customer_id = c.CustomerNaturalKey
JOIN gold.DimProduct p ON o.product_id = p.ProductNaturalKey;

Partitioning Strategy

Partitioning is critical for:

Query performance
Scan reduction
Cost efficiency

Use partitions when:

Data is large (100M+ rows)
Date filters are common
Tenant or region segmentation exists

Best practices

Partition by date or region
Avoid high-cardinality columns
Target file sizes of 128 MB to 1 GB
Too many small files degrade performance

Reference: Delta Lake File Sizing
https://docs.delta.io/latest/best-practices.html

Compaction and File Optimization

Small files kill performance across Databricks, Fabric, Athena, Synapse Serverless, and BigQuery.

Databricks Delta Optimization

OPTIMIZE delta.`/lake/silver/crm/customer`
ZORDER BY (customer_id);

Reference: Z-ordering
https://docs.databricks.com/delta/optimizations/zorder.html

Fabric Optimization

Fabric automatically optimizes Delta files but provides maintenance tooling for compaction.
Reference: Fabric Optimization
https://learn.microsoft.com/en-us/fabric/data-engineering/optimize-lakehouse