Part 1 — Dimensional Modeling and Warehouse Design

Part 1 — Dimensional Modeling and Warehouse Design – SQLYARD
Deep Technical Series — Building a Modern Data Warehouse and Lakehouse

Part 1 — Dimensional Modeling and Warehouse Design


Part 1 of 14 — Deep Technical Series: Building a Modern Data Warehouse and Lakehouse. View the full series index (Part 0) for the complete table of contents and recommended reading order.

Dimensional modeling is the backbone of analytical systems. Whether your platform of choice is Snowflake, Fabric, BigQuery, Databricks, Redshift, Synapse, or a lakehouse built on Delta or Iceberg, dimensional modeling gives you the ability to scale analytics, deliver consistent metrics, simplify downstream transformations, and keep your data warehouse predictable and stable.

This part goes deep into warehouse design, grain definition, facts and dimensions, SCD Type 2, surrogate keys, conformed dimensions, and the bus matrix. Everything you build later in this series depends on the foundation laid here. The code examples work across all major platforms — the schema definitions follow standard SQL that runs on Snowflake, Fabric Warehouse, BigQuery, Redshift, Synapse, and Databricks SQL.

Compatibility: DDL examples use ANSI SQL patterns that work across Snowflake, Fabric Warehouse, BigQuery, Redshift, Synapse, and Databricks SQL. Platform-specific syntax differences are noted inline. The dimensional model you build here becomes your Gold layer in Parts 3, 5, and 6.

1 Why Dimensional Modeling Still Matters Beginner

Modern platforms promote the idea that table formats, compute engines, or storage architectures solve analytical complexity. In reality, architectural decisions only work well when the underlying data model is correct. A poorly modeled Gold layer will be slow, inconsistent, and hard to govern regardless of which platform it runs on.

Dimensional modeling — pioneered by Ralph Kimball and documented in the Kimball Group’s modeling techniques — provides a set of design principles that have proven durable across three decades of platform changes. They remain the standard for analytical data modeling because they solve real problems that alternative approaches do not.

Dimensional modeling provides:

  • Predictable join paths that BI tools can navigate efficiently
  • Stable business metrics that do not change when source systems change
  • Clear grain definitions that prevent double-counting and aggregation errors
  • Conformed dimensions that allow unified reporting across subject areas
  • Historical accuracy through SCD Type 2 — the ability to answer “what was true at the time?”
  • High performance in columnar warehouses — star schemas are optimized for the engines Snowflake, BigQuery, and Fabric use internally
  • A structured foundation for semantic layers, dbt models, and BI tools

This part teaches you how to design the dimensional model that sits on top of your Bronze and Silver layers and forms your Gold layer — the analytics-ready data that feeds your warehouse, lakehouse SQL endpoints, and semantic models.

2 Core Concepts: Facts, Dimensions, and Grain Beginner

Dimensional modeling has two main structures: fact tables and dimension tables. Every other concept in this part builds from these two.

Dimensions — Describing Business Entities

Dimensions describe the who, what, where, and when of business events. They contain descriptive attributes and serve as the lookup tables that give context to numeric measurements in fact tables.

  • Customer — name, email, city, loyalty tier, segment
  • Product — name, category, subcategory, unit cost, brand
  • Date — full date, year, quarter, month, week, day of week, fiscal period
  • Store — name, region, district, address, store type
  • Employee — name, department, manager, hire date, role

Dimensions contain attributes, support historical tracking through SCD Type 2, and always use a surrogate key as the primary key — never the source system’s natural key.

Facts — Measuring Business Events

Facts record measurements about business events. They contain numeric measures and foreign keys that point to dimension tables.

  • FctOrderLine — quantity, unit price, extended amount, discount
  • FctSubscription — monthly recurring revenue, plan amount, billing cycle
  • FctGameSession — session duration, score, events, level reached
  • FctWebEvent — page views, clicks, time on page, bounced flag
  • FctInvoice — invoice amount, tax, payment amount, days to pay

Facts are narrow by design — they contain numeric measures and foreign keys, not descriptive text. The descriptive context lives in dimensions. This separation is what makes columnar warehouses fast: they scan the narrow fact table and only join to the dimensions they need.

Grain — The Most Important Design Decision

The grain defines exactly what a single row in a fact table represents. It must be defined before writing any transformations, building any warehouse layer, or creating any BI reports. Getting the grain wrong means every aggregation built on top of it is potentially wrong.

Define grain before anything else. Every question about a fact table — which measures belong here, which dimensions it joins to, how rows are inserted — depends on the grain. If two people on your team have different mental models of what a fact row represents, every downstream metric will be inconsistent.

Examples of well-defined grains:

  • One row per order line item (not per order — per line item)
  • One row per game match round per player
  • One row per subscription per billing period
  • One row per web event (click, page view, or form submission)
  • One row per daily store inventory snapshot per product

3 Designing Dimensions Beginner

Dimensions must be designed with four properties in mind: they must be surrogate-keyed, richly attributed, conformed across fact tables, and capable of tracking history when attributes change over time.

DimCustomer — The Universal Pattern

The schema below is the standard dimensional customer design. It works across Fabric Warehouse, Snowflake, BigQuery, Redshift, Synapse, SQL Server, and Databricks SQL. The same structure becomes a Delta or Iceberg table in your Gold layer on a lakehouse.

-- DimCustomer: standard SCD Type 2 dimension
-- Works across: Snowflake, Fabric Warehouse, BigQuery, Redshift, Synapse, Databricks SQL
CREATE TABLE DimCustomer (
    CustomerKey        BIGINT         NOT NULL,  -- surrogate key (IDENTITY/AUTOINCREMENT)
    CustomerNaturalKey VARCHAR(100)   NOT NULL,  -- source system ID (CRM ID, User ID)
    FullName           VARCHAR(200)   NOT NULL,
    Email              VARCHAR(200),
    City               VARCHAR(100),
    StateProvince      VARCHAR(100),
    Country            VARCHAR(100),
    LoyaltyTier        VARCHAR(50),
    -- SCD Type 2 tracking columns
    EffectiveFrom      DATE           NOT NULL,
    EffectiveTo        DATE           NOT NULL,  -- '9999-12-31' = currently active
    IsCurrent          TINYINT        NOT NULL,  -- 1 = active, 0 = historical
    CONSTRAINT PK_DimCustomer PRIMARY KEY (CustomerKey)
);

Key design decisions explained:

  • CustomerKey — the surrogate key. Every fact table that references a customer uses this column. Never the natural key. This insulates the warehouse from source system key changes.
  • CustomerNaturalKey — the stable source system identifier. Used during ETL to match incoming records to existing dimension rows.
  • EffectiveFrom / EffectiveTo — the validity range for this version of the record. Active rows have EffectiveTo = '9999-12-31'.
  • IsCurrent — a convenience flag for joining from fact tables. Always join facts to dimensions using AND IsCurrent = 1 unless you specifically need historical analysis.

DimDate — The Required Dimension

Every fact table joins to a date dimension. Never store a raw date in a fact table and skip the date dimension — you lose the ability to filter by fiscal year, quarter, week, or any calendar attribute without recalculating them on every query.

-- DimDate: calendar dimension
-- Populate once from a date spine generator script
-- Covers every date your warehouse will ever query
CREATE TABLE DimDate (
    DateKey         INT          NOT NULL,  -- YYYYMMDD integer key (e.g. 20250115)
    FullDate        DATE         NOT NULL,
    DayOfWeek       TINYINT      NOT NULL,  -- 1=Monday through 7=Sunday
    DayName         VARCHAR(10)  NOT NULL,
    DayOfMonth      TINYINT      NOT NULL,
    DayOfYear       SMALLINT     NOT NULL,
    WeekOfYear      TINYINT      NOT NULL,
    MonthNumber     TINYINT      NOT NULL,
    MonthName       VARCHAR(10)  NOT NULL,
    Quarter         TINYINT      NOT NULL,
    QuarterName     VARCHAR(6)   NOT NULL,  -- 'Q1', 'Q2', etc.
    Year            SMALLINT     NOT NULL,
    IsWeekend       TINYINT      NOT NULL,
    IsHoliday       TINYINT      NOT NULL,
    FiscalYear      SMALLINT,               -- populate if fiscal calendar differs
    FiscalQuarter   TINYINT,
    CONSTRAINT PK_DimDate PRIMARY KEY (DateKey)
);

Generate DimDate once and forget it. Populate it with every date from your earliest data through at least 10 years into the future. A date spine generator script that inserts one row per calendar day takes seconds to run and should never need to be rebuilt. Keep the surrogate key as an integer in YYYYMMDD format — it is human-readable and extremely fast to join on.

4 Designing Fact Tables Beginner

Facts must always follow the grain and contain only what belongs at that grain: foreign keys to dimensions, numeric measures, and the natural key of the transaction for auditing. No descriptive attributes. No business logic applied to measures inside the fact table itself — that belongs in the semantic layer.

FctOrderLine — Standard Transaction Fact

-- FctOrderLine: one row per order line item
-- Grain: one line item per product per order
-- Foreign keys reference surrogate keys in dimension tables
CREATE TABLE FctOrderLine (
    OrderLineKey    BIGINT          NOT NULL,  -- surrogate key
    OrderNaturalKey VARCHAR(100)    NOT NULL,  -- source order ID for audit trail
    OrderDateKey    INT             NOT NULL,  -- FK to DimDate
    CustomerKey     BIGINT          NOT NULL,  -- FK to DimCustomer
    ProductKey      BIGINT          NOT NULL,  -- FK to DimProduct
    StoreKey        BIGINT          NOT NULL,  -- FK to DimStore
    -- Measures -- numeric values only
    Quantity        INT             NOT NULL,
    UnitPrice       DECIMAL(18,2)   NOT NULL,
    DiscountAmount  DECIMAL(18,2)   NOT NULL DEFAULT 0,
    ExtendedAmount  DECIMAL(18,2)   NOT NULL,  -- Quantity * UnitPrice - DiscountAmount
    CONSTRAINT PK_FctOrderLine PRIMARY KEY (OrderLineKey),
    CONSTRAINT FK_OrderLine_Date     FOREIGN KEY (OrderDateKey) REFERENCES DimDate(DateKey),
    CONSTRAINT FK_OrderLine_Customer FOREIGN KEY (CustomerKey)  REFERENCES DimCustomer(CustomerKey),
    CONSTRAINT FK_OrderLine_Product  FOREIGN KEY (ProductKey)   REFERENCES DimProduct(ProductKey),
    CONSTRAINT FK_OrderLine_Store    FOREIGN KEY (StoreKey)     REFERENCES DimStore(StoreKey)
);

Why fact tables stay narrow. Analytical engines — Snowflake, Fabric Warehouse, BigQuery, Databricks — are columnar. They scan only the columns you select. A narrow fact table with 8–12 columns scans dramatically faster than a wide one with 50+ columns. If you are tempted to add descriptive text to a fact table, that attribute belongs in a dimension instead.

Computing Derived Measures

Keep computed measures simple in the fact table — basic arithmetic only. Complex business logic (gross margin, churn rate, lifetime value) belongs in the semantic layer as DAX measures, LookML measures, or dbt Metrics, not in the Gold table itself. This keeps the Gold layer stable when business rules change.

-- Simple derived measure is acceptable in Gold
-- ExtendedAmount is deterministic arithmetic on stored columns
ExtendedAmount = Quantity * UnitPrice - DiscountAmount

-- Complex business logic belongs in the SEMANTIC LAYER, not Gold
-- Example: gross_margin_pct = (ExtendedAmount - CostAmount) / ExtendedAmount
-- This changes when cost accounting rules change -- keep it in DAX/LookML/dbt Metrics

5 Slowly Changing Dimensions — SCD Type 2 Intermediate

SCD Type 2 is the most important dimension management technique in data warehousing. It tracks historical attribute changes by creating a new row each time a dimension attribute changes, rather than overwriting the existing row. This gives you the ability to answer historical questions accurately — “What was this customer’s loyalty tier at the time they placed this order?” — even years after the attribute changed.

Use SCD Type 2 when you need to answer questions like:

  • What was the customer’s region when they first subscribed?
  • Which product category did this item belong to at the time of the sale?
  • What store district was processing this transaction before the reorganization?

SCD Type 2 MERGE — Works Across All Major Warehouses

-- SCD Type 2 MERGE pattern
-- Close the current row when attributes change, insert a new current row
-- Works on: Snowflake, Fabric Warehouse, BigQuery, Redshift, Synapse
-- (Databricks Delta MERGE covered in Part 6)

MERGE INTO DimCustomer AS tgt
USING StgCustomer AS src
    ON tgt.CustomerNaturalKey = src.CustomerID
    AND tgt.IsCurrent = 1  -- only match against the current active row

-- When a match exists AND attributes have changed: close the current row
WHEN MATCHED AND (
        tgt.FullName      <> src.FullName
     OR tgt.Email         <> src.Email
     OR tgt.City          <> src.City
     OR tgt.StateProvince <> src.StateProvince
     OR tgt.LoyaltyTier   <> src.LoyaltyTier
)
THEN UPDATE SET
    tgt.EffectiveTo = CAST(GETDATE() AS DATE),
    tgt.IsCurrent   = 0

-- When no match exists (new customer): insert as current row
WHEN NOT MATCHED BY TARGET THEN
INSERT (
    CustomerNaturalKey, FullName, Email, City,
    StateProvince, Country, LoyaltyTier,
    EffectiveFrom, EffectiveTo, IsCurrent
)
VALUES (
    src.CustomerID,
    src.FullName,
    src.Email,
    src.City,
    src.State,
    src.Country,
    src.LoyaltyTier,
    CAST(GETDATE() AS DATE),
    CAST('9999-12-31' AS DATE),
    1
);
-- Note: This MERGE closes changed rows but does NOT insert the new version.
-- A second INSERT step is needed to add the new current row after closing the old one.
-- See Part 6 for the complete two-step SCD2 pattern and the Databricks Delta approach.

Why the MERGE only closes rows. Most SQL MERGE implementations cannot both close a row and insert a new version in the same statement because the new row’s surrogate key must be generated after the close. The standard pattern is: MERGE to close changed rows, then INSERT the new current versions. Part 6 covers the complete pattern including surrogate key generation and the Databricks Delta MERGE that handles both steps.

Platform-Specific MERGE References

6 dbt Implementation of SCD Type 2 Intermediate

dbt’s incremental model pattern handles SCD Type 2 in a way that is portable across all supported warehouses. The model identifies changed rows by joining the staging source to the current dimension table, then inserts the new versions. dbt handles the incremental refresh and the unique key enforcement.

-- dbt incremental model for DimCustomer SCD Type 2
-- File: models/gold/dim_customer.sql
{{ config(
    materialized  = 'incremental',
    unique_key    = 'customer_natural_key',
    on_schema_change = 'sync_all_columns'
) }}

WITH staged AS (
    SELECT * FROM {{ ref('stg_crm_customer') }}
),

-- Identify rows that are new or have changed attributes
changes AS (
    SELECT s.*
    FROM staged s
    LEFT JOIN {{ this }} t
      ON  s.customer_id = t.customer_natural_key
      AND t.is_current  = 1
    WHERE t.customer_natural_key IS NULL   -- new customer
       OR s.customer_name <> t.full_name   -- name changed
       OR s.email         <> t.email       -- email changed
       OR s.city          <> t.city        -- city changed
)

SELECT
    {{ dbt_utils.generate_surrogate_key(['customer_id']) }} AS customer_key,
    customer_id     AS customer_natural_key,
    customer_name   AS full_name,
    email,
    city,
    state           AS state_province,
    country,
    updated_at      AS effective_from,
    CAST('9999-12-31' AS DATE)  AS effective_to,
    1               AS is_current
FROM changes

dbt packages for SCD2: The dbt_utils package provides generate_surrogate_key() which creates a consistent hash-based surrogate key from natural key columns. For full Type 2 with row closing, the dbt snapshots feature handles the complete open/close lifecycle automatically — recommended for production SCD2 implementations. Snapshots are covered in depth in Part 5.

7 Conformed Dimensions Intermediate

A conformed dimension is a dimension table that is shared across multiple fact tables. DimCustomer, DimDate, DimProduct, and DimRegion are examples of dimensions that typically appear in many different subject areas — orders, subscriptions, support tickets, web events. When these dimensions are conformed — meaning they use the same surrogate keys, the same attribute definitions, and the same grain — you can join fact tables from different subject areas through the shared dimension.

Conformed dimensions are what make enterprise reporting possible. Without them, every subject area has its own version of “customer” with different keys, different attribute names, and different historical tracking logic — and unified reporting across subject areas becomes an engineering project rather than a query.

-- Unified reporting across subject areas using conformed dimensions
-- This query combines order revenue with game session counts
-- across the same customer and date dimensions
SELECT
    d.MonthName,
    d.Year,
    SUM(o.ExtendedAmount)    AS Revenue,
    COUNT(s.SessionKey)      AS GameSessions,
    COUNT(DISTINCT o.CustomerKey) AS ActiveBuyers
FROM FctOrderLine   o
JOIN FctGameSession s ON  o.CustomerKey  = s.CustomerKey
                      AND o.OrderDateKey = s.SessionDateKey
JOIN DimDate        d ON  o.OrderDateKey = d.DateKey
GROUP BY d.MonthName, d.Year
ORDER BY d.Year, MIN(d.MonthNumber);

This query only works correctly if DimDate and DimCustomer are conformed — if both fact tables use the same date keys and the same customer surrogate keys. If each fact table defined its own version of these dimensions, this join would be meaningless or impossible.

8 The Bus Matrix Intermediate

The bus matrix is a planning and documentation tool that maps every fact table to every dimension it joins to. It becomes the architectural blueprint for your warehouse — the single document that shows how all subject areas relate to each other and which dimensions are shared.

Build the bus matrix before writing any DDL. It forces every design decision to be made explicitly and prevents the common mistake of discovering halfway through implementation that two fact tables use incompatible versions of the same dimension.

Fact Table Date Customer Product Store Region Platform Arena
FctOrderLine
FctSubscription
FctGameSession
FctMatchEvent
FctWebEvent

The checkmarks show which dimensions each fact uses. Every checked dimension must be conformed — same surrogate key, same grain, same attribute definitions — across every fact table that uses it. Kimball’s bus architecture documentation covers this in depth.

9 From Dimensional Model to the Lakehouse Gold Layer Intermediate

The dimensional model you design in this part becomes the blueprint for your Gold layer — regardless of which platform you implement it on. The schemas are identical. Only the physical implementation details differ by platform.

PlatformGold Layer ImplementationNotes
SnowflakeManaged tables with clustering keysMicro-partitioning handles physical layout automatically. Add cluster keys on large tables. Covered in Part 7.
Microsoft FabricDelta tables via Lakehouse or Warehouse tablesLakehouse uses Delta format. Warehouse uses columnstore. Both expose SQL endpoint. Covered in Parts 7 and 8.
DatabricksDelta Lake tables in Gold schemaZ-order on join keys. Unity Catalog governs access. ACID transactions on all writes. Covered in Part 8.
BigQueryPartitioned, clustered tablesPartition by date, cluster by customer key. External tables for lake data. Covered in Part 7.
RedshiftDistribution and sort key tablesDistribution key on highest-cardinality join column. Sort key on date. Covered in Part 7.
AWS IcebergIceberg tables via Glue CatalogHidden partitioning, multi-engine access via Athena, EMR, Spark. Covered in Part 8.

Design once, implement everywhere. The bus matrix, the dimension schemas, the fact grain, and the surrogate key strategy you define here are platform-independent. Parts 7 and 8 show you how to implement the physical details on each platform.

10 Workshops

Novice

Build Your First Star Schema

  • Create DimCustomer with surrogate key
  • Generate a DimDate table using a calendar script
  • Create a staging table of orders
  • Build FctOrderLine with foreign keys
  • Run a revenue by month query joining both tables
  • Confirm grain: one row per order line item

Intermediate

Create a Conformed Dimension

  • Build a second fact table (FctGameSession)
  • Reuse DimCustomer and DimDate from above
  • Create DimPlatform as a new dimension
  • Draw the bus matrix for both fact tables
  • Write a unified query joining both facts through shared dimensions
  • Validate row counts and grain consistency

Advanced

Implement SCD Type 2 and Rebuild a Fact

  • Add SCD2 columns to DimCustomer
  • Introduce historical changes in staging data
  • Implement the two-step MERGE and INSERT SCD2 pattern
  • Rebuild FctOrderLine with correct surrogate key lookups
  • Run historical queries comparing revenue against changing customer attributes
  • Validate that old fact rows still point to the correct historical dimension version

References

Up next → Part 2: Choosing the Architecture: Warehouse, Lake, or Lakehouse — Snowflake vs Fabric vs Databricks vs BigQuery vs Synapse vs Redshift, decision framework, hybrid patterns, and when each architecture fits.

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading