Part 5 — Transformations Part 1: ELT Strategy, dbt Modeling, and SQL Pipelines.

Part 5 — Transformations Part 1: ELT Strategy, dbt Modeling, and SQL Pipelines – SQLYARD
Deep Technical Series — Building a Modern Data Warehouse and Lakehouse

Part 5 — Transformations Part 1: ELT Strategy, dbt Modeling, and SQL Pipelines


Part 5 of 14 — Deep Technical Series: Building a Modern Data Warehouse and Lakehouse. Series index (Part 0) · ← Part 4: Ingestion

Transformations are the core of your analytical system. This is where Bronze raw data becomes Silver cleaned data and eventually Gold dimensional models. Parts 5 and 6 together cover the complete transformation stack — Part 5 focuses on strategy, dbt architecture, incremental patterns, and SQL MERGE across platforms. Part 6 covers the advanced mechanics: SCD Type 1 and 2, fact table loading, Spark transformations, and late-arriving dimension handling.

By the end of this part you will understand why modern systems choose ELT over ETL, how to structure a dbt project for a medallion architecture, how to write incremental Silver models that only process changed data, and how to apply MERGE patterns correctly across Snowflake, Fabric, BigQuery, and Databricks.

How this connects to the series: Transformations sit between ingestion (Part 4) and dimensional modeling implementation (Part 6). The Bronze data that arrived via Fivetran, Auto Loader, or Snowpipe in Part 4 is transformed here into Silver staging models. Those Silver models feed the SCD Type 2 dimensions and fact tables built in Part 6. The Gold layer is the dimensional model designed in Part 1, implemented through the transformation patterns in Parts 5 and 6.

1 ELT vs ETL — Why Modern Systems Choose ELT Beginner

Traditional ETL — Extract, Transform, Load — transformed data before it reached storage. A dedicated ETL server (SSIS, Informatica, Ab Initio) pulled data from sources, applied transformations in its own compute layer, and loaded the cleaned output into the warehouse. This worked when the warehouse was expensive and storage was limited, so only pre-cleaned data was worth storing.

Modern ELT — Extract, Load, Transform — inverts this. Raw data is extracted and loaded directly into the warehouse or lake first (Bronze), and transformations happen inside the warehouse or lakehouse engine afterward. This is possible because cloud warehouses and Spark engines provide effectively unlimited compute at the point where the data already lives.

ETL — Transform Before Loading

Source System
    ↓  Extract
ETL Server (SSIS, Informatica)
    ↓  Transform in middleware
    ↓  (data leaves the warehouse)
Warehouse
    ↓  Load cleaned data only
BI Tools

Problems:
- Middleware is a bottleneck
- Raw data is not preserved
- Hard to debug transformation errors
- Scaling requires bigger ETL servers

ELT — Load Then Transform

Source System
    ↓  Extract + Load (Fivetran/Airbyte)
Bronze (raw data preserved)
    ↓  Transform inside warehouse/lake
Silver (dbt, Spark, SQL)
    ↓  Transform inside warehouse/lake
Gold (dbt marts, dimensional models)
    ↓
BI Tools

Advantages:
- Compute scales with the warehouse
- Raw data always available for replay
- Transformations are SQL -- version controlled
- Debugging is transparent

Why ELT Wins in Modern Architectures

  • Pushes computation to the engine nearest the data. Snowflake, BigQuery, Databricks, and Fabric all have massive, scalable compute sitting right next to the data. Running transformations there is faster and cheaper than moving data to an external server.
  • Preserves raw data for replay and audit. Bronze holds the source-of-truth. If your Silver transformation logic was wrong, you can fix the logic and rebuild Silver from Bronze without re-extracting from the source system.
  • Transformations become SQL — version controlled and reviewable. dbt turns SQL transformations into version-controlled code with testing, documentation, and lineage. This replaces proprietary ETL tool configurations that are hard to review and impossible to diff.
  • Scales naturally. When data volume grows, you scale the warehouse compute — you do not buy a bigger ETL server.

Reference: dbt’s founding principle is ELT — using the warehouse as the transformation engine. Their introduction documentation explains the philosophy in detail and is recommended reading before building your first dbt project.

2 Transformations in the Medallion Architecture Beginner

Transformations happen at two layers in the medallion architecture. Understanding which type of transformation belongs at each layer prevents a common mistake — applying business logic too early (in Silver) or failing to clean data properly before applying business logic (in Gold).

LayerTransformation TypeToolsWhat Happens Here
Silver Cleaning and conformance dbt staging models, Spark, SQL views Type casting, deduplication, null handling, JSON flattening, name standardization, incremental MERGE
Gold Dimensional modeling and business logic dbt mart models, Spark SQL, warehouse SQL Surrogate key generation, SCD Type 2, fact grain enforcement, conformed dimension joins, metric-ready measures

Do not apply business logic in Silver. Silver’s job is cleaning and conformance — making raw data reliable and consistently typed. Business logic belongs in Gold. If you apply business rules (revenue calculation, customer segmentation, loyalty tier assignment) in Silver, you mix infrastructure concerns with business concerns and make both harder to maintain. Keep the layers clean and your pipeline will be easier to debug, test, and evolve.

3 dbt — The Standard for SQL Transformations Beginner

dbt (data build tool) is the most widely adopted transformation tool for warehouses and lakehouses as of 2026. It turns SQL SELECT statements into version-controlled, tested, documented data models that run directly inside your warehouse or lakehouse. dbt does not move data — it generates and executes SQL against your existing compute engine, making transformations as fast as the engine they run on.

Reference: dbt Documentation

What dbt Gives You

  • Models as version-controlled SQL. Each transformation is a .sql file in a Git repository. Changes are reviewed, approved, and tracked — just like application code.
  • Automated testing. Built-in tests for uniqueness, not-null, referential integrity, and accepted values. Custom SQL tests for business rules. Failing tests block downstream models from running.
  • Lineage graph. dbt automatically builds a DAG (Directed Acyclic Graph) of all model dependencies. You can see exactly which models a change affects before deploying it.
  • Documentation generation. Column descriptions, model descriptions, and test results are automatically compiled into a browsable documentation site.
  • Incremental models. dbt handles incremental refresh patterns — only processing new or changed rows — without requiring you to write watermark logic manually every time.
  • Cross-platform portability. The same dbt project runs on Snowflake, Fabric, BigQuery, Redshift, Databricks, and DuckDB. Platform differences are handled by adapters.

dbt Core vs dbt Cloud

dbt Coredbt Cloud
CostFree, open sourcePaid (free tier available)
SchedulingExternal (Airflow, Fabric Pipelines, cron)Built-in job scheduler
IDELocal VS Code or any text editorBrowser-based IDE with autocomplete
CI/CDManual GitHub Actions setupNative CI/CD integration
Best forEngineering teams with existing orchestrationTeams wanting managed scheduling and collaboration

4 dbt Project Structure — Staging, Intermediate, and Marts Beginner

dbt encourages a three-layer model structure that maps cleanly to the medallion architecture. Understanding this structure before writing any models prevents the most common dbt project organization mistakes.

Staging (Silver)

One model per source table. Clean, rename, cast types, deduplicate. No joins between sources. No business logic. Maps to Bronze → Silver.

Intermediate

Pre-join logic, complex CTEs, business rule preparation. Optional layer — use when Gold models need reusable intermediate logic. Maps within Silver.

Marts (Gold)

Dimensional models — DimCustomer, DimProduct, FctOrderLine. Business logic, SCD Type 2, surrogate keys, fact grain. Maps to Silver → Gold.

# dbt project folder structure
models/
├── staging/                    # Silver layer -- one model per source table
│   ├── crm/
│   │   ├── _crm__sources.yml   # source definitions
│   │   ├── stg_crm_customer.sql
│   │   └── stg_crm_contacts.sql
│   └── ecommerce/
│       ├── _ecommerce__sources.yml
│       ├── stg_ecommerce_orders.sql
│       └── stg_ecommerce_products.sql
├── intermediate/               # Optional -- complex pre-join logic
│   └── int_customer_with_orders.sql
└── marts/                      # Gold layer -- dimensional models
    ├── core/
    │   ├── dim_customer.sql
    │   ├── dim_product.sql
    │   ├── dim_date.sql
    │   └── fct_order_line.sql
    └── finance/
        └── fct_subscription.sql
# dbt_project.yml: configure materialization by layer
# This sets the default -- individual models can override

name: 'analytics'
version: '1.0.0'
config-version: 2

models:
  analytics:
    staging:
      +materialized: view          # staging models as views (no storage cost)
      +schema: silver              # writes to the 'silver' schema
    intermediate:
      +materialized: ephemeral     # ephemeral = inline CTE, not materialized
    marts:
      +materialized: table         # mart models as physical tables
      +schema: gold                # writes to the 'gold' schema
      core:
        +materialized: incremental # fact tables use incremental refresh

5 dbt Staging Models — Silver Layer Beginner

Staging models are the entry point for every source in your dbt project. One staging model per source table. Their job is cleaning and renaming only — no joins, no business logic, no aggregations. They are the contract between Bronze raw data and everything built on top of it.

Source Definition — Connecting dbt to Bronze

# models/staging/crm/_crm__sources.yml
# Defines the Bronze source tables that staging models read from
version: 2

sources:
  - name: crm_bronze
    database: bronze          # the Bronze database or schema
    schema: crm
    tables:
      - name: customer
        description: "Raw CRM customer data from Fivetran"
        freshness:
          warn_after: {count: 12, period: hour}   # warn if no new data in 12 hours
          error_after: {count: 24, period: hour}   # error if no new data in 24 hours
        loaded_at_field: _ingested_at              # column used for freshness check
        columns:
          - name: customer_id
            description: "Source CRM customer identifier"
          - name: updated_at
            description: "Last modified timestamp from source CRM"

Staging Model — stg_crm_customer.sql

-- models/staging/crm/stg_crm_customer.sql
-- Silver: clean, rename, cast -- nothing more

WITH source AS (
    -- Reference the Bronze source defined in _crm__sources.yml
    SELECT * FROM {{ source('crm_bronze', 'customer') }}
),

cleaned AS (
    SELECT
        -- Natural key: cast to consistent type
        CAST(customer_id AS {{ dbt.type_bigint() }})    AS customer_id,

        -- Name standardization: InitCap and trim whitespace
        INITCAP(TRIM(customer_name))                    AS customer_name,

        -- Email: always lowercase
        LOWER(TRIM(email))                              AS email,

        -- Location: apply defaults for nulls
        COALESCE(NULLIF(TRIM(city),   ''), 'Unknown')  AS city,
        COALESCE(NULLIF(TRIM(state),  ''), 'Unknown')  AS state_province,
        COALESCE(NULLIF(TRIM(country),''), 'Unknown')  AS country,

        -- Loyalty tier: valid values only
        CASE
            WHEN loyalty_tier IN ('Bronze','Silver','Gold','Platinum')
            THEN loyalty_tier
            ELSE 'Unknown'
        END                                             AS loyalty_tier,

        -- Timestamps: cast to consistent timezone-aware type
        CAST(updated_at AS TIMESTAMP)                  AS updated_at,

        -- Ingestion metadata passthrough
        _ingested_at,
        _source_file

    FROM source
    WHERE customer_id IS NOT NULL   -- exclude records without a natural key
),

-- Deduplication: keep the most recently updated record per customer_id
deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC
        ) AS _row_num
    FROM cleaned
)

SELECT * EXCEPT (_row_num)
FROM deduplicated
WHERE _row_num = 1

Staging Model Tests

# models/staging/crm/_crm__models.yml
# Schema tests run after every dbt build to validate Silver data quality

version: 2

models:
  - name: stg_crm_customer
    description: "Cleaned and deduplicated CRM customer records"
    columns:
      - name: customer_id
        description: "Source CRM customer identifier"
        tests:
          - not_null
          - unique           # after deduplication, every customer_id must be unique
      - name: email
        tests:
          - not_null
      - name: loyalty_tier
        tests:
          - accepted_values:
              values: ['Bronze', 'Silver', 'Gold', 'Platinum', 'Unknown']
      - name: updated_at
        tests:
          - not_null

6 dbt Incremental Models Intermediate

Incremental models process only new or changed rows rather than rebuilding the entire table on every run. For large Silver tables with millions of rows, this is the difference between a 5-second run and a 2-hour run. dbt handles the incremental logic automatically — you define the filter condition and dbt wraps it correctly for your target platform.

How dbt Incremental Works

-- models/staging/crm/stg_crm_customer_incremental.sql
-- Incremental version: only processes rows updated since last run

{{ config(
    materialized     = 'incremental',
    unique_key       = 'customer_id',
    -- Strategy options depend on your warehouse adapter:
    -- 'append'        -- insert new rows only (no updates)
    -- 'merge'         -- upsert: insert new + update existing (recommended)
    -- 'delete+insert' -- delete matching rows then insert (Spark/BigQuery)
    incremental_strategy = 'merge',
    on_schema_change = 'append_new_columns'
) }}

WITH source AS (
    SELECT * FROM {{ source('crm_bronze', 'customer') }}

    -- The is_incremental() macro is TRUE when the model already exists
    -- and is being refreshed incrementally (not on first run or --full-refresh)
    {% if is_incremental() %}
    WHERE _ingested_at > (
        -- Only process rows ingested after the last run
        SELECT MAX(_ingested_at) FROM {{ this }}
    )
    {% endif %}
),

cleaned AS (
    SELECT
        CAST(customer_id AS BIGINT)     AS customer_id,
        INITCAP(TRIM(customer_name))    AS customer_name,
        LOWER(TRIM(email))              AS email,
        COALESCE(city, 'Unknown')       AS city,
        state                           AS state_province,
        country,
        loyalty_tier,
        CAST(updated_at AS TIMESTAMP)   AS updated_at,
        _ingested_at
    FROM source
    WHERE customer_id IS NOT NULL
),

deduplicated AS (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC
        ) AS _row_num
    FROM cleaned
)

SELECT * EXCEPT (_row_num)
FROM deduplicated
WHERE _row_num = 1

Run a full refresh when needed. The --full-refresh flag drops and rebuilds the incremental model from scratch: dbt run --select stg_crm_customer --full-refresh. Use this after a backfill, a schema change, or when you need to reprocess historical data. Incremental models are efficient for daily operations but always support full refresh for recovery scenarios.

Incremental Model Behavior by Strategy

StrategyHandles Updates?Handles Deletes?Best ForSupported On
appendNoNoImmutable event data (logs, clicks)All platforms
mergeYesSoft onlyMutable dimension data, SCD prepSnowflake, Fabric, BigQuery, Databricks
delete+insertYesYes (partition)Partitioned fact tablesBigQuery, Spark
insert_overwriteYesYesPartitioned tables with full partition overwritesSpark, BigQuery

7 dbt Merge Strategy — Warehouse-Native MERGE Intermediate

When using incremental_strategy = 'merge', dbt generates a MERGE statement natively for your target platform. The generated SQL is platform-optimized — Snowflake MERGE syntax for Snowflake, BigQuery MERGE syntax for BigQuery, and so on. You write one dbt model and get correct MERGE SQL on every platform.

-- dbt generates this MERGE for Snowflake when using merge strategy:
MERGE INTO silver.stg_crm_customer AS tgt
USING (
    -- The incremental source query from your model
    SELECT * FROM bronze.crm_customer
    WHERE _ingested_at > (SELECT MAX(_ingested_at) FROM silver.stg_crm_customer)
) AS src
ON tgt.customer_id = src.customer_id

WHEN MATCHED THEN UPDATE SET
    tgt.customer_name  = src.customer_name,
    tgt.email          = src.email,
    tgt.city           = src.city,
    tgt.state_province = src.state_province,
    tgt.loyalty_tier   = src.loyalty_tier,
    tgt.updated_at     = src.updated_at,
    tgt._ingested_at   = src._ingested_at

WHEN NOT MATCHED THEN INSERT (
    customer_id, customer_name, email, city,
    state_province, country, loyalty_tier,
    updated_at, _ingested_at
)
VALUES (
    src.customer_id, src.customer_name, src.email, src.city,
    src.state_province, src.country, src.loyalty_tier,
    src.updated_at, src._ingested_at
);

dbt handles the platform differences for you. The same incremental_strategy = 'merge' configuration generates Snowflake MERGE syntax for Snowflake, BigQuery DML MERGE for BigQuery, and Delta MERGE for Databricks. You do not need to write platform-specific MERGE SQL in your model — dbt’s adapters handle the dialect translation. This is one of the most valuable features of dbt for teams working across multiple platforms.

8 SQL Silver Transformations — All Platforms Intermediate

For teams not using dbt, or for platform-specific Silver patterns, here are direct SQL equivalents for each major warehouse and lakehouse platform.

Snowflake Silver — CREATE OR REPLACE TABLE

-- Snowflake: full Silver rebuild (use for initial load or after backfill)
CREATE OR REPLACE TABLE silver.customer AS
SELECT
    CAST(customer_id AS BIGINT)             AS customer_id,
    INITCAP(TRIM(customer_name))            AS customer_name,
    LOWER(TRIM(email))                      AS email,
    COALESCE(NULLIF(TRIM(city),''),'Unknown') AS city,
    state                                   AS state_province,
    country,
    loyalty_tier,
    CAST(updated_at AS TIMESTAMP_NTZ)       AS updated_at,
    CURRENT_TIMESTAMP()                     AS silver_loaded_at
FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC
        ) AS rn
    FROM bronze.crm_customer
    WHERE customer_id IS NOT NULL
) AS deduped
WHERE rn = 1;

Fabric Warehouse Silver — T-SQL

-- Fabric Warehouse: Silver view over Bronze table
-- Views in Fabric Warehouse are always fresh -- no scheduling needed for read patterns
CREATE OR ALTER VIEW silver.Customer AS
SELECT
    CAST(customer_id AS BIGINT)                             AS CustomerID,
    INITCAP(TRIM(customer_name))                           AS CustomerName,
    LOWER(TRIM(email))                                     AS Email,
    COALESCE(NULLIF(TRIM(city), ''), 'Unknown')            AS City,
    COALESCE(NULLIF(TRIM(state), ''), 'Unknown')           AS StateProvince,
    country                                                 AS Country,
    loyalty_tier                                           AS LoyaltyTier,
    CAST(updated_at AS DATETIME2)                          AS UpdatedAt
FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC
        ) AS rn
    FROM bronze.crm_customer
    WHERE customer_id IS NOT NULL
) AS deduped
WHERE rn = 1;

BigQuery Silver — CREATE OR REPLACE TABLE

-- BigQuery: Silver table with partitioning
CREATE OR REPLACE TABLE silver.customer
PARTITION BY DATE(updated_at)
AS
SELECT
    SAFE_CAST(customer_id AS INT64)                         AS customer_id,
    INITCAP(TRIM(customer_name))                            AS customer_name,
    LOWER(TRIM(email))                                      AS email,
    COALESCE(NULLIF(TRIM(city), ''), 'Unknown')             AS city,
    COALESCE(NULLIF(TRIM(state), ''), 'Unknown')            AS state_province,
    country,
    loyalty_tier,
    CAST(updated_at AS TIMESTAMP)                           AS updated_at,
    CURRENT_TIMESTAMP()                                     AS silver_loaded_at
FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY updated_at DESC
        ) AS rn
    FROM bronze.crm_customer
    WHERE customer_id IS NOT NULL
)
WHERE rn = 1;

9 Deduplication Patterns Beginner

Deduplication is one of the most important Silver-layer operations and one of the most frequently implemented incorrectly. The ROW_NUMBER() window function pattern is the standard approach — it assigns a rank to each row within a group of duplicates and keeps only the highest-ranked one.

-- Standard deduplication pattern: keep the most recent record per natural key
-- Works on: Snowflake, Fabric, BigQuery, Redshift, Synapse, Databricks

SELECT *
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id     -- group duplicates by natural key
            ORDER BY updated_at DESC     -- rank by most recently updated first
        ) AS _row_num
    FROM bronze.crm_customer
    WHERE customer_id IS NOT NULL
) AS ranked
WHERE _row_num = 1;                      -- keep only the most recent version
-- Alternative: QUALIFY syntax (Snowflake and BigQuery -- cleaner for large queries)
SELECT *
FROM bronze.crm_customer
WHERE customer_id IS NOT NULL
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY updated_at DESC
) = 1;

Deduplication ordering matters. ORDER BY updated_at DESC keeps the most recently updated record. If your source does not have a reliable updated_at column, use _ingested_at DESC to keep the most recently ingested version. Never use ORDER BY (SELECT NULL) — this picks an arbitrary row and produces non-deterministic results that will differ across runs.

10 JSON Flattening — Snowflake and BigQuery Intermediate

API payloads, CDC streams, and event data frequently arrive as nested JSON. Flattening JSON into columns is a common Silver operation. Each platform has its own syntax for navigating and extracting nested structures.

Snowflake JSON Flattening — VARIANT Column

-- Snowflake: extract from VARIANT column (raw JSON stored as VARIANT in Bronze)
SELECT
    raw_data:id::STRING                     AS event_id,
    raw_data:user_id::BIGINT                AS user_id,
    raw_data:event_type::STRING             AS event_type,
    raw_data:payload:device::STRING         AS device,
    raw_data:payload:os_version::STRING     AS os_version,
    raw_data:payload:screen_name::STRING    AS screen_name,
    raw_data:metadata:session_id::STRING    AS session_id,
    raw_data:timestamp::TIMESTAMP_NTZ       AS event_timestamp
FROM bronze.crm_events;

-- Snowflake LATERAL FLATTEN: explode array fields into rows
-- Use when one JSON record contains an array that should become multiple Silver rows
SELECT
    e.raw_data:order_id::STRING             AS order_id,
    item.value:product_id::STRING           AS product_id,
    item.value:quantity::INT                AS quantity,
    item.value:unit_price::DECIMAL(18,2)    AS unit_price
FROM bronze.ecommerce_orders e,
LATERAL FLATTEN(input => e.raw_data:line_items) AS item;
-- Each element of line_items array becomes its own row in Silver

BigQuery JSON Flattening — Nested STRUCT and ARRAY

-- BigQuery: unnest nested repeated fields (ARRAY of STRUCT)
SELECT
    order_id,
    customer_id,
    order_date,
    item.product_id,
    item.quantity,
    item.unit_price
FROM bronze.ecommerce_orders,
UNNEST(line_items) AS item;   -- UNNEST explodes ARRAY into rows

-- BigQuery: extract from JSON STRING column
SELECT
    JSON_VALUE(raw_json, '$.event_id')          AS event_id,
    JSON_VALUE(raw_json, '$.user_id')           AS user_id,
    JSON_VALUE(raw_json, '$.payload.device')    AS device,
    JSON_VALUE(raw_json, '$.payload.os')        AS os_version,
    CAST(JSON_VALUE(raw_json, '$.timestamp')
        AS TIMESTAMP)                           AS event_timestamp
FROM bronze.api_events;

11 Incremental Silver with MERGE — All Platforms Intermediate

The MERGE statement is the production pattern for incremental Silver updates. It handles both new records (insert) and changed records (update) in a single atomic operation, making Silver refreshes efficient and idempotent.

Snowflake MERGE

-- Snowflake: incremental Silver MERGE
-- Processes only rows from Bronze ingested after the last Silver load
MERGE INTO silver.customer AS tgt
USING (
    SELECT
        CAST(customer_id AS BIGINT)         AS customer_id,
        INITCAP(TRIM(customer_name))        AS customer_name,
        LOWER(TRIM(email))                  AS email,
        COALESCE(city, 'Unknown')           AS city,
        state                               AS state_province,
        country,
        loyalty_tier,
        CAST(updated_at AS TIMESTAMP_NTZ)   AS updated_at
    FROM bronze.crm_customer
    WHERE _ingested_at > (SELECT MAX(_ingested_at) FROM silver.customer)
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY customer_id ORDER BY updated_at DESC
    ) = 1
) AS src
ON tgt.customer_id = src.customer_id

WHEN MATCHED AND (
       tgt.customer_name  <> src.customer_name
    OR tgt.email          <> src.email
    OR tgt.city           <> src.city
    OR tgt.loyalty_tier   <> src.loyalty_tier
) THEN UPDATE SET
    tgt.customer_name  = src.customer_name,
    tgt.email          = src.email,
    tgt.city           = src.city,
    tgt.state_province = src.state_province,
    tgt.loyalty_tier   = src.loyalty_tier,
    tgt.updated_at     = src.updated_at

WHEN NOT MATCHED THEN INSERT (
    customer_id, customer_name, email, city,
    state_province, country, loyalty_tier, updated_at
)
VALUES (
    src.customer_id, src.customer_name, src.email, src.city,
    src.state_province, src.country, src.loyalty_tier, src.updated_at
);

Fabric Warehouse MERGE — T-SQL

-- Fabric Warehouse: T-SQL MERGE (same syntax as SQL Server)
MERGE silver.Customer AS tgt
USING (
    SELECT
        CAST(customer_id AS BIGINT)     AS CustomerID,
        INITCAP(customer_name)          AS CustomerName,
        LOWER(email)                    AS Email,
        COALESCE(city, 'Unknown')       AS City,
        state                           AS StateProvince,
        country                         AS Country,
        loyalty_tier                    AS LoyaltyTier,
        CAST(updated_at AS DATETIME2)   AS UpdatedAt
    FROM bronze.crm_customer
    WHERE _ingested_at > (SELECT MAX(UpdatedAt) FROM silver.Customer)
) AS src
ON tgt.CustomerID = src.CustomerID
WHEN MATCHED THEN UPDATE SET
    tgt.CustomerName  = src.CustomerName,
    tgt.Email         = src.Email,
    tgt.City          = src.City,
    tgt.LoyaltyTier   = src.LoyaltyTier,
    tgt.UpdatedAt     = src.UpdatedAt
WHEN NOT MATCHED THEN INSERT (
    CustomerID, CustomerName, Email, City,
    StateProvince, Country, LoyaltyTier, UpdatedAt
)
VALUES (
    src.CustomerID, src.CustomerName, src.Email, src.City,
    src.StateProvince, src.Country, src.LoyaltyTier, src.UpdatedAt
);

BigQuery MERGE

-- BigQuery: DML MERGE
MERGE silver.customer AS tgt
USING (
    SELECT
        SAFE_CAST(customer_id AS INT64)     AS customer_id,
        INITCAP(TRIM(customer_name))        AS customer_name,
        LOWER(TRIM(email))                  AS email,
        COALESCE(city, 'Unknown')           AS city,
        state                               AS state_province,
        country,
        loyalty_tier,
        CAST(updated_at AS TIMESTAMP)       AS updated_at
    FROM bronze.crm_customer
    WHERE _ingested_at > (
        SELECT MAX(_ingested_at) FROM silver.customer
    )
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY customer_id ORDER BY updated_at DESC
    ) = 1
) AS src
ON tgt.customer_id = src.customer_id
WHEN MATCHED THEN UPDATE SET
    customer_name  = src.customer_name,
    email          = src.email,
    city           = src.city,
    loyalty_tier   = src.loyalty_tier,
    updated_at     = src.updated_at
WHEN NOT MATCHED THEN INSERT ROW;

12 Surrogate Key Generation Intermediate

Surrogate keys are generated during the transition from Silver to Gold. They must be stable — the same natural key always maps to the same surrogate key — and they must be integers for optimal join performance in columnar warehouses. The three common approaches are identity/autoincrement columns, sequence objects, and hash-based keys.

-- Option 1: ROW_NUMBER() -- simple, deterministic for initial load
-- Use for initial Gold population; does not work for incremental additions
SELECT
    ROW_NUMBER() OVER (ORDER BY customer_id)  AS customer_key,
    customer_id                               AS customer_natural_key,
    customer_name,
    email,
    city,
    state_province,
    country,
    loyalty_tier
FROM silver.customer;

-- Option 2: Identity / Autoincrement -- warehouse manages key assignment
-- Use when inserting incrementally -- warehouse assigns the next key automatically

-- Snowflake: AUTOINCREMENT
CREATE TABLE gold.dim_customer (
    customer_key        BIGINT AUTOINCREMENT PRIMARY KEY,
    customer_natural_key VARCHAR(100) NOT NULL,
    customer_name       VARCHAR(200),
    email               VARCHAR(200)
);

-- Fabric / SQL Server: IDENTITY
CREATE TABLE gold.DimCustomer (
    CustomerKey         BIGINT IDENTITY(1,1) PRIMARY KEY,
    CustomerNaturalKey  VARCHAR(100) NOT NULL,
    CustomerName        VARCHAR(200),
    Email               VARCHAR(200)
);

-- BigQuery: GENERATE_UUID() for string keys (no native integer sequences)
-- Or use ROW_NUMBER() at query time for integer keys
SELECT
    GENERATE_UUID()     AS customer_key,
    customer_id         AS customer_natural_key,
    customer_name,
    email
FROM silver.customer;

-- Option 3: dbt surrogate key -- hash of natural key columns
-- Produces a consistent string key across all platforms
-- Use dbt_utils package: dbt_utils.generate_surrogate_key([])
SELECT
    {{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_key,
    customer_id    AS customer_natural_key,
    customer_name,
    email
FROM {{ ref('stg_crm_customer') }}

Integer surrogate keys outperform hash/UUID keys. Columnar warehouse join performance on integer keys is consistently faster than on string hash keys. When storage and query performance matter — and they always do at scale — use integer surrogate keys (IDENTITY or AUTOINCREMENT) rather than dbt’s hash-based surrogate key for Gold dimensional tables. Use dbt surrogate keys for intermediate models where convenience matters more than peak performance.

13 Late-Arriving Dimensions — Preview Advanced

Late-arriving dimensions occur when a fact record arrives in Silver before the corresponding dimension record exists. If you attempt to join a fact to a dimension on a natural key that does not yet exist in Gold, the join fails and the fact row is either dropped or produces a null surrogate key — both are wrong.

The complete solution — including the unknown dimension row pattern, fact load with late-arriving handling, and surrogate key backfill — is covered in full in Part 6. The pattern to be aware of now is that your Silver transformation should pass natural keys through to Gold unchanged. The Gold layer handles the surrogate key lookup, and the unknown key pattern handles the case where the dimension does not yet exist.

-- Silver: pass natural keys through -- do not attempt surrogate key lookup in Silver
-- Silver's job is cleaning, not dimensional modeling

-- CORRECT Silver model: natural keys only
SELECT
    order_id,
    customer_id,    -- natural key -- Gold will look this up in DimCustomer
    product_id,     -- natural key -- Gold will look this up in DimProduct
    order_date,
    quantity,
    unit_price
FROM bronze.ecommerce_orders
WHERE order_id IS NOT NULL;

-- INCORRECT: attempting surrogate key lookup in Silver
-- If dim_customer is not yet populated this join silently drops rows
SELECT
    o.order_id,
    c.customer_key,     -- DO NOT do this in Silver
    o.quantity
FROM bronze.ecommerce_orders o
JOIN gold.dim_customer c ON c.customer_natural_key = o.customer_id;  -- WRONG in Silver

14 Workshops

Novice

Build Your First Silver Model

  • Load Bronze CRM customer data (CSV or JSON)
  • Apply INITCAP, LOWER, and null defaults
  • Deduplicate using ROW_NUMBER()
  • Save as Silver table
  • Compare row counts: Bronze vs Silver
  • Verify no duplicate customer_id values exist in Silver

Intermediate

Build a dbt Incremental Silver Pipeline

  • Create a dbt project with the three-layer folder structure
  • Define a source for your Bronze table
  • Build a staging model with deduplication and type casting
  • Add not_null and unique tests
  • Convert to incremental model with updated_at watermark
  • Run dbt run, add new Bronze rows, run again — verify only new rows processed

Advanced

Build dbt Incremental Model with Merge Strategy

  • Create an incremental model with incremental_strategy = 'merge'
  • Set on_schema_change = 'append_new_columns'
  • Add a new column to your Bronze source and confirm Silver absorbs it
  • Use EXPLAIN to verify the generated MERGE SQL is correct
  • Test --full-refresh and confirm it rebuilds from scratch
  • Add a custom SQL test that validates no revenue-negative rows exist

References

Up next → Part 6: Transformations Part 2: SCD Types, Fact Loading, Spark, and Merge Optimization — SCD Type 1 and Type 2 in depth, Spark Delta MERGE, fact grain enforcement, late-arriving dimensions, audit tables, and performance-tuning for merge-heavy workloads.

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading