Part 13 — Security, Governance, Lineage, and Compliance

Part 13 — Security, Governance, Lineage, and Compliance – SQLYARD
Deep Technical Series — Building a Modern Data Warehouse and Lakehouse

Part 13 — Security, Governance, Lineage, and Compliance


Part 13 of 14 — Deep Technical Series: Building a Modern Data Warehouse and Lakehouse. Series index (Part 0) · ← Part 12: Performance Engineering

A modern data warehouse or lakehouse must be secure, governed, traceable, and compliant. Without a strong governance and security framework, data becomes a liability rather than an asset — untrusted, misused, and non-compliant with regulatory requirements that carry real financial and reputational consequences.

This part covers the complete governance stack: identity and access management, role-based and row-level security, column-level masking, data classification and tagging, lineage tracking, audit logging, encryption, and regulatory compliance. Every concept is illustrated with working code across Unity Catalog, Microsoft Purview, Snowflake, BigQuery, and Fabric.

How this connects to the series: Governance is not a final step applied after the platform is built — it is a design constraint that must be considered from the beginning. Access control decisions made in Parts 7 and 8 (who can query Gold tables), lineage from the transformation pipeline in Parts 5 and 6, and the security model applied to the semantic layer in Part 9 all feed into the governance architecture defined here. This part brings it all together into a coherent, auditable framework.

1 Governance Principles — Security, Trust, and Compliance Beginner

Data governance is the framework of policies, processes, and technical controls that ensure data is used correctly, protected appropriately, and handled in compliance with regulations. For data engineering teams, governance translates into concrete technical implementations: who can access what, what sensitive data looks like to unauthorized users, where data came from, and who did what with it.

The Five Governance Pillars

  • Access control — only authorized users and systems can access the data they need for their role. Enforced through RBAC, RLS, and CLS.
  • Data protection — sensitive data is masked, tokenized, or encrypted for unauthorized viewers. PII is never exposed to users who do not have a business need for it.
  • Lineage and cataloging — every data asset is documented, every transformation is traceable, and every downstream dependency is known.
  • Auditing — every query, every access, every schema change is logged and immutable. When something goes wrong, you can answer “who saw what, and when?”
  • Compliance — data handling meets the requirements of applicable regulations: GDPR for EU personal data, HIPAA for US healthcare data, PCI DSS for payment card data, SOC 2 for service organization controls.

Governance applied after the fact is dramatically harder than governance built in from the start. Retrofitting row-level security onto a data warehouse that has been serving unsecured reports for two years means updating every report, every semantic model, and every connection. Design access control, masking policies, and lineage tracking as part of your initial architecture — not as an afterthought when a compliance audit arrives.

2 Identity and Access Management Beginner

Every governance capability depends on a reliable, centralized identity system. When a user queries a Gold table, the warehouse must know who they are, what groups they belong to, and what they are authorized to access. All of this flows from the identity provider (IdP).

PlatformIdentity ProviderIntegration Method
Microsoft Fabric / Azure SQLMicrosoft Entra ID (AAD)Native — no configuration needed
DatabricksEntra ID, Okta, or any SAML 2.0 IdPSCIM provisioning to Unity Catalog
SnowflakeOkta, AAD, any SAML 2.0 IdPSSO via SAML or OAuth
BigQueryGoogle Cloud Identity or WorkspaceIAM native integration
RedshiftAWS IAM, Okta, AAD via federationIAM role mapping or SAML federation
SynapseMicrosoft Entra IDNative — same as Fabric

Use groups, not individual users, for access control. Grant permissions to security groups (Data Engineers, Analysts, Finance Team, Regional Managers) rather than to individual users. When someone joins or leaves a team, you update their group membership in the identity provider — all their data permissions follow automatically. Managing permissions per user does not scale and leads to access control drift over time.

3 Role-Based Access Control Across Platforms Intermediate

RBAC assigns permissions to roles, and users inherit permissions through role membership. The goal is the principle of least privilege: every user and system has access to exactly what they need for their function and nothing more. A business analyst needs SELECT on Gold tables. A data engineer needs SELECT and INSERT on Bronze, Silver, and Gold. A pipeline service account needs INSERT on specific target tables only.

Databricks Unity Catalog RBAC

-- Unity Catalog: three-level namespace with granular permission control
-- Permissions cascade: catalog → schema → table → column

-- Environment separation: separate catalogs for dev/test/prod
CREATE CATALOG IF NOT EXISTS prod  COMMENT 'Production analytics data';
CREATE CATALOG IF NOT EXISTS dev   COMMENT 'Development and testing';

-- Grant catalog-level access to groups (synced from Entra ID / Okta via SCIM)
GRANT USE CATALOG ON CATALOG prod TO `data-engineers`;
GRANT USE CATALOG ON CATALOG prod TO `analysts`;
GRANT USE CATALOG ON CATALOG prod TO `bi-team`;

-- Schema-level access: engineers get full access, analysts read-only on Gold only
GRANT USE SCHEMA, SELECT, MODIFY ON SCHEMA prod.bronze TO `data-engineers`;
GRANT USE SCHEMA, SELECT, MODIFY ON SCHEMA prod.silver TO `data-engineers`;
GRANT USE SCHEMA, SELECT, MODIFY ON SCHEMA prod.gold   TO `data-engineers`;

GRANT USE SCHEMA, SELECT ON SCHEMA prod.gold ON CATALOG prod TO `analysts`;
-- Analysts cannot see Bronze or Silver -- only Gold and semantic views

-- Service account for pipeline automation: write access to specific tables only
GRANT USE SCHEMA, SELECT, INSERT, MODIFY ON TABLE prod.gold.dim_customer TO `pipeline-svc`;
GRANT USE SCHEMA, SELECT, INSERT, MODIFY ON TABLE prod.gold.fct_order_line TO `pipeline-svc`;

-- Verify permissions
SHOW GRANTS ON SCHEMA prod.gold;
SHOW GRANTS ON TABLE prod.gold.fct_order_line;

Snowflake RBAC

-- Snowflake: hierarchical role model
-- Best practice: functional roles (analyst_role, engineer_role) granted to user roles

-- Create functional roles
CREATE ROLE IF NOT EXISTS analyst_role;
CREATE ROLE IF NOT EXISTS engineer_role;
CREATE ROLE IF NOT EXISTS pipeline_role;

-- Grant data access to functional roles
GRANT USAGE ON DATABASE analytics TO analyst_role;
GRANT USAGE ON SCHEMA analytics.gold TO analyst_role;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics.gold TO analyst_role;
GRANT SELECT ON FUTURE TABLES IN SCHEMA analytics.gold TO analyst_role;

GRANT USAGE ON DATABASE analytics TO engineer_role;
GRANT USAGE ON ALL SCHEMAS IN DATABASE analytics TO engineer_role;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN DATABASE analytics TO engineer_role;

-- Grant roles to users (or to SSO groups via SCIM)
GRANT ROLE analyst_role  TO USER alice;
GRANT ROLE engineer_role TO USER bob;

-- Grant access to virtual warehouses
GRANT USAGE ON WAREHOUSE bi_warehouse  TO ROLE analyst_role;
GRANT USAGE ON WAREHOUSE etl_warehouse TO ROLE engineer_role;

-- Verify effective permissions
SHOW GRANTS TO ROLE analyst_role;
SHOW GRANTS TO USER alice;

BigQuery IAM

# BigQuery: IAM roles via Google Cloud IAM
# Applied at project, dataset, or table level

# Dataset-level access via gcloud CLI:
bq add-iam-policy-binding \
    --member="group:data-analysts@company.com" \
    --role="roles/bigquery.dataViewer" \
    project:gold

# Table-level access (more granular):
bq add-iam-policy-binding \
    --member="serviceAccount:pipeline@project.iam.gserviceaccount.com" \
    --role="roles/bigquery.dataEditor" \
    "project:gold.fct_order_line"

# BigQuery predefined roles:
# bigquery.dataViewer  -- SELECT only
# bigquery.dataEditor  -- SELECT, INSERT, UPDATE, DELETE
# bigquery.admin       -- full control including schema management

4 Row-Level Security Intermediate

Row-level security (RLS) restricts which rows a user can see within a table. A regional sales manager sees only rows for their region. A country-level analyst sees only their country’s data. RLS must be enforced at the data layer — not in dashboards or application code — so that every access path respects the security policy.

Snowflake Row Access Policies

-- Snowflake: row access policy using a mapping table
-- More flexible than hard-coded conditions -- users/regions can be managed in SQL

-- Create a policy mapping table: user → allowed regions
CREATE TABLE governance.user_region_access (
    username    VARCHAR(200) NOT NULL,
    region      VARCHAR(100) NOT NULL
);

INSERT INTO governance.user_region_access VALUES
    ('alice@company.com', 'Northeast'),
    ('alice@company.com', 'Southeast'),
    ('bob@company.com',   'West');

-- Create the row access policy
CREATE OR REPLACE ROW ACCESS POLICY rls_region
AS (region_column VARCHAR) RETURNS BOOLEAN ->
    -- Admins see everything; other users see only their allowed regions
    IS_ROLE_IN_SESSION('analyst_admin')
    OR EXISTS (
        SELECT 1
        FROM governance.user_region_access
        WHERE username = CURRENT_USER()
        AND   region   = region_column
    );

-- Apply policy to the fact table
ALTER TABLE gold.fct_order_line
ADD ROW ACCESS POLICY rls_region ON (region);

-- Now every query on fct_order_line is automatically filtered
-- alice@company.com sees Northeast + Southeast rows only
-- bob@company.com sees West rows only
-- analyst_admin role sees all rows

-- Verify policy is applied
SHOW ROW ACCESS POLICIES IN TABLE gold.fct_order_line;

Databricks Unity Catalog Row Filters

-- Databricks: row filter functions in Unity Catalog
-- Applied at the table level -- cannot be bypassed regardless of query method

-- Create a row filter function
CREATE OR REPLACE FUNCTION prod.governance.filter_by_region(region STRING)
RETURNS BOOLEAN
LANGUAGE SQL
RETURN
    IS_MEMBER('data-admin-group')   -- admins see everything
    OR region = (
        SELECT allowed_region
        FROM prod.governance.user_region_map
        WHERE username = CURRENT_USER()
        LIMIT 1
    );

-- Apply the row filter to the Gold fact table
ALTER TABLE prod.gold.fct_order_line
SET ROW FILTER prod.governance.filter_by_region ON (region);

-- Verify row filter is active
DESCRIBE EXTENDED prod.gold.fct_order_line;
-- Look for: Row Filter section showing the function and column it applies to

5 Column-Level Security and Data Masking Intermediate

Column-level security (CLS) controls which columns users can see. Data masking goes further — it allows unauthorized users to see a column exists but shows them a masked or replaced value rather than the real one. This is critical for PII (personally identifiable information) like email addresses, phone numbers, Social Security numbers, and payment card data.

Snowflake Dynamic Data Masking

-- Snowflake: dynamic masking policy -- same column, different values per role
-- Authorized users see the real value; others see a masked version

-- Create a masking policy for email addresses
CREATE OR REPLACE MASKING POLICY mask_email
AS (val STRING) RETURNS STRING ->
    CASE
        -- PII analysts see full email
        WHEN IS_ROLE_IN_SESSION('pii_analyst_role') THEN val
        -- Other analysts see partial email: first char + *** + @domain.com
        WHEN IS_ROLE_IN_SESSION('analyst_role') THEN
            CONCAT(SUBSTR(val, 1, 1), '***@', SPLIT_PART(val, '@', 2))
        -- All others see completely masked value
        ELSE '***@***.***'
    END;

-- Apply masking to the email column in the dimension table
ALTER TABLE gold.dim_customer
MODIFY COLUMN email SET MASKING POLICY mask_email;

-- Test masking behavior:
-- As analyst_role:      a***@company.com    (partial masking)
-- As pii_analyst_role:  alice@company.com   (real value)
-- As other roles:       ***@***.***         (full masking)

Databricks Unity Catalog Column Masks

-- Databricks: column masking via Unity Catalog
-- Same approach as row filters -- function applied at table level

CREATE OR REPLACE FUNCTION prod.governance.mask_email(email STRING)
RETURNS STRING
LANGUAGE SQL
RETURN
    CASE
        WHEN IS_MEMBER('pii-access-group') THEN email
        ELSE CONCAT(SUBSTR(email, 1, 1), '***@', SPLIT_PART(email, '@', 2))
    END;

-- Apply column mask to the dimension table
ALTER TABLE prod.gold.dim_customer
ALTER COLUMN email SET MASK prod.governance.mask_email;

-- Column mask applies to all queries on this table regardless of access path
-- Direct SQL, Power BI, Tableau, Spark -- all see the masked value

BigQuery Column-Level Security with Policy Tags

# BigQuery: column-level security via Data Catalog policy tags
# 1. Create a policy tag taxonomy in Data Catalog
# 2. Tag sensitive columns with the appropriate policy tag
# 3. Grant data catalog fine-grained reader role to authorized groups

# Create taxonomy and tags via gcloud:
gcloud data-catalog taxonomies create "PII_Data" \
    --location=us \
    --description="Tags for personally identifiable information"

gcloud data-catalog taxonomies policy-tags create \
    --taxonomy="PII_Data" \
    --location=us \
    --display-name="EmailAddress" \
    --description="Personal email addresses"

# Apply policy tag to column in BigQuery schema (via console or API):
# Column: email in table project.gold.dim_customer
# Tag: PII_Data/EmailAddress

# Grant fine-grained reader to authorized group:
gcloud data-catalog taxonomies policy-tags add-iam-policy-binding \
    "PII_Data/EmailAddress" \
    --location=us \
    --member="group:pii-analysts@company.com" \
    --role="roles/datacatalog.categoryFineGrainedReader"

# Without the fine-grained reader role, the column returns NULL for that user
# With the role, the full value is returned

6 Data Classification and Tagging Beginner

Data classification identifies which data is sensitive, what type of sensitivity it represents (PII, PHI, financial, confidential), and what handling rules apply. Tags attach metadata to tables and columns that can be used to enforce policies, generate compliance reports, and communicate data sensitivity to consumers.

Databricks Unity Catalog Tags

-- Databricks: apply tags to schemas, tables, and columns for classification

-- Tag the entire schema as containing PII
ALTER SCHEMA prod.gold
SET TAGS ('data_sensitivity' = 'contains_pii', 'data_domain' = 'customer');

-- Tag specific PII columns in the dimension table
ALTER TABLE prod.gold.dim_customer
ALTER COLUMN email
SET TAGS ('pii_type' = 'email', 'gdpr_subject' = 'true', 'ccpa_subject' = 'true');

ALTER TABLE prod.gold.dim_customer
ALTER COLUMN customer_name
SET TAGS ('pii_type' = 'full_name', 'gdpr_subject' = 'true');

-- Search for all PII columns across the entire Unity Catalog
SELECT
    table_catalog,
    table_schema,
    table_name,
    column_name,
    tag_name,
    tag_value
FROM system.information_schema.column_tags
WHERE tag_name = 'pii_type'
ORDER BY table_catalog, table_schema, table_name, column_name;

Snowflake Object Tagging

-- Snowflake: create and apply governance tags

-- Create a tag for PII classification
CREATE OR REPLACE TAG governance.pii_type
    ALLOWED_VALUES 'email', 'full_name', 'phone', 'ssn', 'credit_card', 'address';

-- Apply tags to sensitive columns
ALTER TABLE gold.dim_customer
MODIFY COLUMN email     SET TAG governance.pii_type = 'email';
ALTER TABLE gold.dim_customer
MODIFY COLUMN customer_name SET TAG governance.pii_type = 'full_name';

-- Query all tagged PII columns for compliance reporting
SELECT
    table_schema,
    table_name,
    column_name,
    TAG_VALUE('governance.pii_type') AS pii_classification
FROM TABLE(INFORMATION_SCHEMA.TAG_REFERENCES_ALL_COLUMNS(
    'gold.dim_customer', 'table'
))
WHERE TAG_VALUE('governance.pii_type') IS NOT NULL;

7 Lineage Tracking — Where Did the Data Come From? Intermediate

Data lineage answers the questions that every organization faces during an audit, a data quality incident, or a regulatory review: Where did this number come from? Which upstream sources feed this report? If the Bronze CRM data had an error last Tuesday, which Gold tables and dashboards were affected? Without lineage, answering these questions requires days of manual investigation. With lineage, the answer is a graph query.

Types of Lineage

  • Table lineage — which tables read from which other tables across transformations
  • Column lineage — which source column contributed to which target column
  • Pipeline lineage — which pipeline jobs transformed which tables
  • Dashboard lineage — which reports and dashboards depend on which semantic models and tables

Databricks Unity Catalog Lineage

-- Databricks: Unity Catalog automatically captures lineage for SQL queries
-- and notebook cell executions -- no manual configuration required

-- View table-level lineage for the Gold fact table
SELECT *
FROM system.access.table_lineage
WHERE target_table_full_name = 'prod.gold.fct_order_line'
ORDER BY event_time DESC
LIMIT 20;
-- Shows: which tables were read to produce fct_order_line, timestamps, job IDs

-- View column-level lineage
SELECT
    source_table_full_name,
    source_column_name,
    target_table_full_name,
    target_column_name
FROM system.access.column_lineage
WHERE target_table_full_name = 'prod.gold.fct_order_line'
AND   target_column_name     = 'extended_amount';
-- Shows: extended_amount came from silver.orders.quantity * silver.orders.unit_price

dbt Lineage Graph

# dbt: automatic lineage from model dependency graph
# Run after dbt compile to generate the lineage manifest

dbt compile --target prod
# Generates target/manifest.json containing full lineage graph

# View lineage in dbt docs:
dbt docs generate --target prod
dbt docs serve
# Opens browser UI showing full DAG: sources → staging → intermediate → marts

# Query lineage programmatically from the manifest:
import json

with open("target/manifest.json") as f:
    manifest = json.load(f)

# Find all models that depend on stg_crm_customer
target_model = "model.analytics.stg_crm_customer"
dependents = [
    node for node, data in manifest["nodes"].items()
    if target_model in data.get("depends_on", {}).get("nodes", [])
]
print("Models downstream of stg_crm_customer:")
for dep in dependents:
    print(f"  {dep}")

Microsoft Purview Lineage

# Microsoft Purview: enterprise lineage across all Azure data services
# Automatically scans: Azure SQL, Fabric, Synapse, ADF, Power BI

# After connecting Purview to your data sources via the Purview portal:
# Purview → Data map → Sources → Add source → Azure SQL / Fabric / Synapse

# Purview discovers assets and builds lineage automatically from:
# - ADF pipeline activities (Copy Activity shows source → sink)
# - Power BI refresh lineage (which dataset feeds which report)
# - Synapse pipelines
# - Fabric Data Factory pipelines

# Query Purview lineage via REST API:
import requests

purview_account = "your-purview-account"
response = requests.get(
    f"https://{purview_account}.purview.azure.com/datamap/api/lineage"
    f"/uniqueAttributeByType/azure_sql_table"
    f"?typeName=azure_sql_table&qualifiedName=mssql://server/db/schema/table",
    headers={"Authorization": f"Bearer {access_token}"}
)
lineage = response.json()
# Returns upstream and downstream lineage for the specified table

8 Audit Logging Intermediate

Audit logs record who accessed what data, when, and from where. They are the evidence layer for compliance frameworks, security incident investigations, and governance audits. All major platforms maintain audit logs automatically — the data engineering team’s responsibility is to ensure logs are retained for the required period and monitored for anomalies.

-- Snowflake: query access history from ACCOUNT_USAGE
-- Available with a 3-hour latency; retained for 365 days on Enterprise+

SELECT
    query_id,
    user_name,
    role_name,
    query_type,
    query_text,
    start_time,
    end_time,
    bytes_scanned,
    rows_produced
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD('day', -7, CURRENT_TIMESTAMP())
AND CONTAINS(LOWER(query_text), 'dim_customer')  -- queries touching customer data
ORDER BY start_time DESC
LIMIT 100;

-- Find users who accessed PII tables in the last 30 days
SELECT
    user_name,
    COUNT(*) AS query_count,
    SUM(bytes_scanned) AS bytes_scanned,
    MIN(start_time) AS first_access,
    MAX(start_time) AS last_access
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE start_time >= DATEADD('day', -30, CURRENT_TIMESTAMP())
AND CONTAINS(LOWER(query_text), 'email')
GROUP BY user_name
ORDER BY query_count DESC;
-- Databricks: Unity Catalog audit logs via system tables
SELECT
    event_time,
    user_identity.email        AS user_email,
    service_name,
    action_name,
    request_params,
    response.status_code       AS status
FROM system.access.audit
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS
AND action_name IN ('commandSubmit', 'createTable', 'dropTable', 'grantPrivilege')
ORDER BY event_time DESC
LIMIT 100;

-- Detect potential data exfiltration: unusually large queries by individual users
SELECT
    user_identity.email        AS user_email,
    DATE(event_time)           AS query_date,
    COUNT(*)                   AS query_count,
    SUM(CAST(request_params.bytes_read AS BIGINT)) / POW(1024,3) AS gb_read
FROM system.access.audit
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL 30 DAYS
AND action_name = 'commandSubmit'
GROUP BY 1, 2
HAVING gb_read > 10    -- flag users reading more than 10 GB in a day
ORDER BY gb_read DESC;

9 Encryption at Rest and in Transit Beginner

All major cloud data platforms encrypt data at rest and in transit by default. The data engineering team’s responsibility is to understand the encryption model used, confirm that encryption is enabled and verified, and for regulated industries, ensure that customer-managed keys (CMK) are used when required by compliance frameworks.

PlatformAt RestIn TransitCustomer-Managed Keys
SnowflakeAES-256 (default), Tri-Secret Secure (CMK)TLS 1.2+Tri-Secret Secure with AWS KMS / Azure Key Vault
Microsoft Fabric / Azure SQLAES-256 (TDE default)TLS 1.2+Azure Key Vault (BYOK)
DatabricksCloud provider default + DBFS encryptionTLS 1.2+AWS KMS / Azure Key Vault / GCP KMS
BigQueryAES-256 (default)TLS 1.2+Cloud KMS CMEK
RedshiftAES-256 (optional, enable explicitly)SSL requiredAWS KMS or HSM

Amazon Redshift encryption is not enabled by default. Unlike other platforms, Redshift requires explicit enablement of encryption at cluster creation time. Enable it for all production clusters — it cannot be enabled on an existing unencrypted cluster without migrating to a new one. This is a common compliance gap found during AWS security audits.

10 Regulatory Compliance Frameworks Intermediate

Regulatory compliance frameworks define what organizations must do to legally handle specific types of data. Understanding which frameworks apply to your organization and what they require technically is essential for designing a compliant data platform.

FrameworkApplies ToKey Technical Requirements
GDPROrganizations processing EU residents’ personal dataRight to erasure (delete PII on request), data minimization, purpose limitation, breach notification within 72 hours, data processing records
CCPABusinesses collecting California residents’ personal dataRight to deletion, right to opt out of data sale, disclosure of data categories collected, privacy policy requirements
HIPAAUS healthcare organizations and their business associatesPHI encryption at rest and in transit, access controls, audit logs for PHI access, Business Associate Agreements (BAA) with cloud providers
PCI DSSOrganizations handling payment card dataCardholder data encrypted or tokenized, access controls on cardholder data environment, quarterly vulnerability scans, audit logs retained 1 year
SOC 2Service organizations storing customer dataSecurity, availability, processing integrity, confidentiality, and privacy controls — audited annually by a third-party CPA

GDPR Right to Erasure — Technical Implementation

-- GDPR: implement right to erasure (right to be forgotten)
-- When a customer requests deletion of their personal data,
-- all PII must be removed from Bronze, Silver, Gold, and audit logs

-- Step 1: Identify all tables containing the customer's data
-- Use the lineage graph from Section 7 to find all affected tables

-- Step 2: Delete or anonymize PII in each layer

-- Gold: update dimension to anonymize PII (preserve fact rows for analytics)
UPDATE gold.dim_customer
SET
    customer_name  = 'ANONYMIZED',
    email          = CONCAT('anon_', customer_natural_key, '@deleted.invalid'),
    city           = NULL,
    state_province = NULL,
    country        = NULL
WHERE customer_natural_key = :subject_customer_id;

-- Silver: delete the customer record (rebuilds from Bronze on next pipeline run)
DELETE FROM silver.customer
WHERE customer_id = :subject_customer_id;

-- Bronze: anonymize (cannot delete -- this would break pipeline replay audit)
-- Best practice: mark as deleted and anonymize PII fields
UPDATE bronze.crm_customer
SET
    raw_data = OBJECT_DELETE(
        OBJECT_INSERT(raw_data, 'customer_name', 'ANONYMIZED', true),
        'email'
    ),
    gdpr_deleted_at = CURRENT_TIMESTAMP()
WHERE raw_data:id::STRING = :subject_customer_id;

-- Step 3: Log the erasure request for compliance documentation
INSERT INTO governance.erasure_requests (
    request_id, subject_id, requested_at, completed_at, tables_affected, status
)
VALUES (
    :request_id, :subject_customer_id,
    :requested_at, CURRENT_TIMESTAMP(),
    'bronze.crm_customer, silver.customer, gold.dim_customer',
    'COMPLETED'
);

11 End-to-End Governance Architecture Intermediate

A complete governance system integrates all the components covered in this part into a coherent, auditable framework. This is the reference architecture for a production data platform that can pass compliance audits, respond to security incidents, and maintain stakeholder trust.

LayerComponentImplementation
IdentityCentralized IdP with SSOEntra ID / Okta + SCIM to Unity Catalog / Snowflake
Access ControlRole-based permissions by groupUnity Catalog / Snowflake RBAC on Bronze/Silver/Gold schemas
Row SecurityRow-level filtering by user attributeSnowflake row access policies / Databricks row filter functions
Column SecurityPII masking by roleSnowflake dynamic masking / Databricks column masks / BigQuery policy tags
ClassificationPII/PHI column taggingUnity Catalog tags / Snowflake object tags / Purview classification
LineageEnd-to-end data lineageUnity Catalog auto-lineage + dbt manifest + Purview for cross-service
AuditImmutable access logsSnowflake ACCOUNT_USAGE / Databricks system.access.audit / CloudTrail
EncryptionAt rest and in transitPlatform defaults + CMK for regulated data (HIPAA, PCI)
ComplianceGDPR/CCPA erasure + HIPAA/PCI controlsErasure request table + anonymization procedures + BAAs with cloud providers

Start with RBAC and column masking — they deliver the most governance value with the least implementation complexity. Correct RBAC prevents unauthorized access. Column masking protects PII from users who do not need it. Both are platform-native features requiring no additional infrastructure. Once these are in place, add row-level security for multi-tenant scenarios, then invest in lineage and tagging as the platform matures.

12 Workshops

Novice

Apply RBAC and RLS

  • Create two roles: analyst_role and engineer_role
  • Grant SELECT on Gold to analyst_role
  • Grant full access on all schemas to engineer_role
  • Create a table with a region column
  • Apply a row access policy restricting rows by current user
  • Test: connect as both roles and verify different row sets are returned

Intermediate

Build Lineage with dbt and Purview

  • Build dbt models for DimCustomer, DimProduct, and FctOrders
  • Run dbt docs generate and explore the lineage graph
  • Connect Microsoft Purview to your Azure SQL or Fabric source
  • Run a Purview scan and view the discovered assets
  • Validate column-level lineage end-to-end: source → Bronze → Silver → Gold

Advanced

Full Governance Implementation

  • Enable Unity Catalog and define catalogs, schemas, and tables
  • Apply PII tags to email, name, and phone columns
  • Create column masking policies showing partial email to analysts
  • Apply row filters by region using a security mapping table
  • Query system.access.audit to verify all accesses are logged
  • Implement the GDPR erasure procedure for a test customer ID
  • Verify: anonymized customer still has fact rows but no recoverable PII

References

Up next → Part 14: Full End-to-End Project and Architecture Blueprint — Build the complete enterprise analytics platform from ingestion to BI dashboard, incorporating every concept from Parts 0 through 13.

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading