Deep Technical Series: Building a Modern Data Warehouse and Data Lake
A data warehouse or lakehouse is only as strong as the quality of the data flowing through it. Even the best dimensional model, semantic layer, and orchestrator cannot overcome unreliable data. Modern analytics teams rely on automated testing, validation, freshness checks, anomaly detection, and observability tools to guarantee trust in every metric and report.
This part defines the full data quality stack, explains how to integrate testing across the Bronze, Silver, and Gold layers, and shows how platforms like dbt, Great Expectations, Soda, and Monte Carlo implement quality controls in real production systems.
Why Data Quality and Observability Matter
Data quality problems create:
- Incorrect KPIs
- Broken dashboards
- Mismatched revenue numbers
- Lost trust from stakeholders
- Failed SLA commitments
- Costly manual investigations
Modern teams rely on data observability platforms to detect, monitor, and prevent these issues.
Monte Carlo summarizes this challenge clearly in their definition of data observability
Monte Carlo Data Observability,
noting that today’s pipelines require proactive detection, not reactive debugging.
Where Data Quality Fits in the Medallion Architecture
Data quality checks belong at every layer:
Bronze (Raw Data)
- Schema drift detection
- Ingestion completeness
- API failure capture
- File count and size expectations
Silver (Cleaned Data)
- Null and type validation
- Deduplication checks
- Business rule enforcement
- Key constraints
- Conformance checks
Gold (Dimensional Models)
- Surrogate key mapping validation
- Fact table grain checks
- SCD accuracy
- Relationship integrity
- KPI reasonableness checks
dbt describes this layering pattern directly in their testing guide
dbt Testing Fundamentals.
Core Components of Data Quality
1. Schema Tests
Ensure column structure matches expectations.
Includes:
- Data types
- Column existence
- Enum domain validation
- Nested structure validation (JSON)
Great Expectations provides YAML-driven schema validation
Great Expectations Documentation.
2. Column-Level Tests
Common tests include:
- not_null
- unique
- accepted_values
- relationships
- type tests
- length constraints
dbt offers these tests built-in
dbt Built-In Tests.
3. Freshness Checks
Detect staleness in:
- Streaming sources
- CDC streams
- Batch extracts
- Warehouse tables
dbt manages table freshness
dbt Freshness Tests.
Soda provides continuous data monitoring
Soda Core.
4. Row Count and Volume Checks
Ensures:
- Expected daily file counts
- No unexpected drop-offs
- No sudden spikes
Monte Carlo automates this using machine learning
Monte Carlo Volume Monitoring.
5. Referential Integrity Tests
Ensures foreign keys in fact tables map to valid dimension surrogate keys.
BigQuery’s documentation outlines integrity strategies for large analytical tables
BigQuery Best Practices.
6. Anomaly Detection
Detects:
- Surges in null values
- Unexpected duplicates
- Negative values in financial fields
- Metric drift (e.g., daily revenue unexpectedly drops by 90 percent)
Soda Cloud provides anomaly tests
Soda Cloud.
Monte Carlo applies ML-based anomaly detection
Monte Carlo Anomaly Detection.
7. Business Rule Testing
Examples:
- Subscription must have an end_date > start_date
- Order quantity must be positive
- Premium plan must have a price higher than the basic plan
These rules are enforced in Silver or Gold using SQL or dbt tests.
dbt Testing in Depth
dbt is the most widely adopted testing framework in ELT systems
dbt Tests Documentation.
dbt supports:
- Schema tests
- Custom SQL tests
- Data quality gates in CI/CD
- Freshness
- Relationships
- Surrogate key validity
dbt Schema Test Example
version: 2
models:
- name: dim_customer
columns:
- name: customer_id
tests:
- not_null
- unique
- name: email
tests:
- not_null
dbt Relationship Test
yaml
- name: fct_order_line
columns:
- name: customer_key
tests:
- relationships:
to: ref('dim_customer')
field: customer_key
yaml
- name: fct_order_line
columns:
- name: customer_key
tests:
- relationships:
to: ref('dim_customer')
field: customer_key
This ensures no orphan fact rows exist in Gold.
dbt Custom SQL Test
sql
SELECT *
FROM dim_product
WHERE price < 0
dbt considers a test failed if any rows are returned.
Great Expectations for Data Lakes and Lakehouses
Great Expectations integrates directly with:
- Databricks
- Snowflake
- BigQuery
- Redshift
- S3
- ADLS
Reference:
Great Expectations Documentation.
Example Expectation Suite
yaml
expect_table_row_count_to_be_between:
min_value: 1000
expect_column_values_to_not_be_null:
column: customer_id
expect_column_values_to_match_regex:
column: email
regex: "^.+@.+$"
Great Expectations is ideal for file-based testing in Bronze and Silver.
Soda for Observability and Testing
Soda focuses on:
- Continuous monitoring
- Alerts
- Anomaly detection
- Metric observability
- Freshness tracking
Reference:
Soda Core.
Example Soda Test
checks:
- row_count > 0
- missing_count(customer_id) = 0
- avg(extended_amount) > 0
Soda Cloud integrates easily with warehouses and lakehouses.
Monte Carlo for Data Observability
Monte Carlo is the leading data observability platform for enterprise systems
Monte Carlo Data Observability.
It provides:
- Automated anomaly detection
- Freshness monitoring
- Volume tracking
- Schema drift alerts
- Lineage visualization
- Incident management
Monte Carlo monitors pipelines end to end, not just tests.
Putting It All Together in the Medallion Architecture
| Layer | Data Quality Focus |
|---|---|
| Bronze | File completeness, schema drift, freshness |
| Silver | Null tests, type validation, relationships, duplicates |
| Gold | Metric correctness, grain validation, surrogate key integrity |
dbt tests run in Silver and Gold, Great Expectations validates Bronze and Silver, Soda and Monte Carlo cover all layers.
Quality Gates in Orchestration
Quality gates should block downstream processes.
Example:
- If Bronze file count < expected, block Silver
- If SCD2 dimension fails surrogate key test, block fact loads
- If daily revenue deviates 80 percent from median, alert BI team
Airflow and Databricks Workflows can both stop pipelines based on test failures
Airflow Operators and
Databricks Workflow Monitoring.
Workshops
Novice Workshop: dbt Testing
- Create a dbt project
- Add not_null and unique tests for DimCustomer
- Add a relationship test for FctOrderLine → DimCustomer
- Run
dbt test - Review failures
Intermediate Workshop: Great Expectations + Lakehouse
- Create an expectation suite
- Validate Bronze files for schema drift
- Add Silver-level null and regex tests
- Trigger expectations as part of pipeline runs
Advanced Workshop: Full Observability Pipeline
Tools: Airflow, dbt Cloud, Monte Carlo, and Fabric
Steps:
- Run Bronze → Silver pipeline in Airflow
- Add dbt tests to block Gold on failure
- Configure Monte Carlo to monitor table freshness and volume
- Add ML-based anomaly detection for revenue metrics
- Trigger Fabric semantic model refresh only after all checks pass
This replicates enterprise-grade quality and observability.
References
dbt Tests Documentation
https://docs.getdbt.com/docs/build/tests
dbt Freshness Tests
https://docs.getdbt.com/docs/build/sources#freshness
Great Expectations Documentation
https://docs.greatexpectations.io/docs/
Soda Core
https://docs.soda.io/soda-core/
Soda Cloud
https://docs.soda.io/soda-cloud/
Monte Carlo Data Observability
https://www.montecarlodata.com/data-observability/
Monte Carlo Platform Monitoring
https://www.montecarlodata.com/platform/
Apache Airflow Documentation
https://airflow.apache.org/docs/
Databricks Workflow Monitoring
https://docs.databricks.com/workflows/monitoring-events.html
BigQuery Best Practices
https://cloud.google.com/bigquery/docs/best-practices-performance-overview
Discover more from SQLYARD
Subscribe to get the latest posts sent to your email.


