PART 8 — Data Quality, Testing, Validation, and Observability in Modern Data Platforms

Deep Technical Series: Building a Modern Data Warehouse and Data Lake

A data warehouse or lakehouse is only as strong as the quality of the data flowing through it. Even the best dimensional model, semantic layer, and orchestrator cannot overcome unreliable data. Modern analytics teams rely on automated testing, validation, freshness checks, anomaly detection, and observability tools to guarantee trust in every metric and report.

This part defines the full data quality stack, explains how to integrate testing across the Bronze, Silver, and Gold layers, and shows how platforms like dbt, Great Expectations, Soda, and Monte Carlo implement quality controls in real production systems.


Why Data Quality and Observability Matter

Data quality problems create:

  • Incorrect KPIs
  • Broken dashboards
  • Mismatched revenue numbers
  • Lost trust from stakeholders
  • Failed SLA commitments
  • Costly manual investigations

Modern teams rely on data observability platforms to detect, monitor, and prevent these issues.

Monte Carlo summarizes this challenge clearly in their definition of data observability
Monte Carlo Data Observability,
noting that today’s pipelines require proactive detection, not reactive debugging.


Where Data Quality Fits in the Medallion Architecture

Data quality checks belong at every layer:

Bronze (Raw Data)

  • Schema drift detection
  • Ingestion completeness
  • API failure capture
  • File count and size expectations

Silver (Cleaned Data)

  • Null and type validation
  • Deduplication checks
  • Business rule enforcement
  • Key constraints
  • Conformance checks

Gold (Dimensional Models)

  • Surrogate key mapping validation
  • Fact table grain checks
  • SCD accuracy
  • Relationship integrity
  • KPI reasonableness checks

dbt describes this layering pattern directly in their testing guide
dbt Testing Fundamentals.


Core Components of Data Quality

1. Schema Tests

Ensure column structure matches expectations.

Includes:

  • Data types
  • Column existence
  • Enum domain validation
  • Nested structure validation (JSON)

Great Expectations provides YAML-driven schema validation
Great Expectations Documentation.


2. Column-Level Tests

Common tests include:

  • not_null
  • unique
  • accepted_values
  • relationships
  • type tests
  • length constraints

dbt offers these tests built-in
dbt Built-In Tests.


3. Freshness Checks

Detect staleness in:

  • Streaming sources
  • CDC streams
  • Batch extracts
  • Warehouse tables

dbt manages table freshness
dbt Freshness Tests.

Soda provides continuous data monitoring
Soda Core.


4. Row Count and Volume Checks

Ensures:

  • Expected daily file counts
  • No unexpected drop-offs
  • No sudden spikes

Monte Carlo automates this using machine learning
Monte Carlo Volume Monitoring.


5. Referential Integrity Tests

Ensures foreign keys in fact tables map to valid dimension surrogate keys.

BigQuery’s documentation outlines integrity strategies for large analytical tables
BigQuery Best Practices.


6. Anomaly Detection

Detects:

  • Surges in null values
  • Unexpected duplicates
  • Negative values in financial fields
  • Metric drift (e.g., daily revenue unexpectedly drops by 90 percent)

Soda Cloud provides anomaly tests
Soda Cloud.

Monte Carlo applies ML-based anomaly detection
Monte Carlo Anomaly Detection.


7. Business Rule Testing

Examples:

  • Subscription must have an end_date > start_date
  • Order quantity must be positive
  • Premium plan must have a price higher than the basic plan

These rules are enforced in Silver or Gold using SQL or dbt tests.


dbt Testing in Depth

dbt is the most widely adopted testing framework in ELT systems
dbt Tests Documentation.

dbt supports:

  • Schema tests
  • Custom SQL tests
  • Data quality gates in CI/CD
  • Freshness
  • Relationships
  • Surrogate key validity

dbt Schema Test Example

version: 2

models:
  - name: dim_customer
    columns:
      - name: customer_id
        tests:
          - not_null
          - unique
      - name: email
        tests:
          - not_null

dbt Relationship Test

yaml
- name: fct_order_line
  columns:
    - name: customer_key
      tests:
        - relationships:
            to: ref('dim_customer')
            field: customer_key
yaml
- name: fct_order_line
  columns:
    - name: customer_key
      tests:
        - relationships:
            to: ref('dim_customer')
            field: customer_key

This ensures no orphan fact rows exist in Gold.


dbt Custom SQL Test

sql
SELECT *
FROM dim_product
WHERE price < 0

dbt considers a test failed if any rows are returned.


Great Expectations for Data Lakes and Lakehouses

Great Expectations integrates directly with:

  • Databricks
  • Snowflake
  • BigQuery
  • Redshift
  • S3
  • ADLS

Reference:
Great Expectations Documentation.

Example Expectation Suite

yaml
expect_table_row_count_to_be_between:
  min_value: 1000

expect_column_values_to_not_be_null:
  column: customer_id

expect_column_values_to_match_regex:
  column: email
  regex: "^.+@.+$"

Great Expectations is ideal for file-based testing in Bronze and Silver.


Soda for Observability and Testing

Soda focuses on:

  • Continuous monitoring
  • Alerts
  • Anomaly detection
  • Metric observability
  • Freshness tracking

Reference:
Soda Core.

Example Soda Test

checks:
  - row_count > 0
  - missing_count(customer_id) = 0
  - avg(extended_amount) > 0

Soda Cloud integrates easily with warehouses and lakehouses.


Monte Carlo for Data Observability

Monte Carlo is the leading data observability platform for enterprise systems
Monte Carlo Data Observability.

It provides:

  • Automated anomaly detection
  • Freshness monitoring
  • Volume tracking
  • Schema drift alerts
  • Lineage visualization
  • Incident management

Monte Carlo monitors pipelines end to end, not just tests.


Putting It All Together in the Medallion Architecture

LayerData Quality Focus
BronzeFile completeness, schema drift, freshness
SilverNull tests, type validation, relationships, duplicates
GoldMetric correctness, grain validation, surrogate key integrity

dbt tests run in Silver and Gold, Great Expectations validates Bronze and Silver, Soda and Monte Carlo cover all layers.


Quality Gates in Orchestration

Quality gates should block downstream processes.

Example:

  • If Bronze file count < expected, block Silver
  • If SCD2 dimension fails surrogate key test, block fact loads
  • If daily revenue deviates 80 percent from median, alert BI team

Airflow and Databricks Workflows can both stop pipelines based on test failures
Airflow Operators and
Databricks Workflow Monitoring.


Workshops


Novice Workshop: dbt Testing

  1. Create a dbt project
  2. Add not_null and unique tests for DimCustomer
  3. Add a relationship test for FctOrderLine → DimCustomer
  4. Run dbt test
  5. Review failures

Intermediate Workshop: Great Expectations + Lakehouse

  1. Create an expectation suite
  2. Validate Bronze files for schema drift
  3. Add Silver-level null and regex tests
  4. Trigger expectations as part of pipeline runs

Advanced Workshop: Full Observability Pipeline

Tools: Airflow, dbt Cloud, Monte Carlo, and Fabric

Steps:

  1. Run Bronze → Silver pipeline in Airflow
  2. Add dbt tests to block Gold on failure
  3. Configure Monte Carlo to monitor table freshness and volume
  4. Add ML-based anomaly detection for revenue metrics
  5. Trigger Fabric semantic model refresh only after all checks pass

This replicates enterprise-grade quality and observability.


References

dbt Tests Documentation
https://docs.getdbt.com/docs/build/tests

dbt Freshness Tests
https://docs.getdbt.com/docs/build/sources#freshness

Great Expectations Documentation
https://docs.greatexpectations.io/docs/

Soda Core
https://docs.soda.io/soda-core/

Soda Cloud
https://docs.soda.io/soda-cloud/

Monte Carlo Data Observability
https://www.montecarlodata.com/data-observability/

Monte Carlo Platform Monitoring
https://www.montecarlodata.com/platform/

Apache Airflow Documentation
https://airflow.apache.org/docs/

Databricks Workflow Monitoring
https://docs.databricks.com/workflows/monitoring-events.html

BigQuery Best Practices
https://cloud.google.com/bigquery/docs/best-practices-performance-overview


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading