Microsoft Fabric: The Complete Guide — OneLake, Lakehouse, Pipelines, Medallion Architecture, and the Full Workshop

By David Yard · SQLYARD.com · April 2026 · Estimated read: 35–45 min

1. Introduction — What Is Microsoft Fabric and Why Should You Care?

For two decades, building a modern data platform meant assembling a collection of separately licensed, separately configured, separately monitored services. You had Azure Data Factory for ingestion, Azure Data Lake Storage Gen2 for raw storage, Azure Synapse Analytics for the warehouse, Azure Databricks for Spark engineering, Power BI Premium for reporting, and Azure Purview for governance. Each one was excellent at its job. But together they created an operational puzzle that required specialist expertise to wire up, maintain, and govern.^[1]

Microsoft Fabric changes that equation. Launched in general availability in November 2023, Fabric is an end-to-end analytics platform delivered as a single SaaS product. It unifies data engineering, data warehousing, real-time analytics, data science, and business intelligence under one roof, all sharing a single logical data lake called OneLake.^[2]

Think of it this way: if Azure was a set of best-in-class tools purchased from different shops, Fabric is those same tools redesigned to work together from a single factory. The data doesn’t need to move. The security model is shared. Billing is consolidated. And every engineer, analyst, and data scientist is working on the same copy of the truth.

Who this guide is for

This post is written for a wide audience — SQL Server DBAs exploring cloud-native analytics, data engineers new to Fabric, architects evaluating the platform, and advanced practitioners looking for a single comprehensive reference. The workshop section is structured in progressive stages so you can stop at whichever level matches your current needs.

2. Why Fabric? The Business and Technical Case

Unified Storage — One Copy of the Truth

Every Fabric workload — Spark notebooks, T-SQL queries, Power BI reports, real-time streams — reads from and writes to the same OneLake. There is no copying data between a lake, a staging area, and a warehouse. When a data engineer transforms a Delta table in a notebook, that same table is immediately visible to a SQL developer in the Warehouse and to a Power BI report through Direct Lake.^[3]

Open Formats by Default

OneLake stores everything in Delta Parquet format — the same open format used by Apache Spark, Databricks, and the broader data engineering ecosystem. You are never locked into a proprietary binary format that requires migration tools to escape.^[3]

SaaS Simplicity

Fabric is delivered as software-as-a-service. You buy capacity (F-SKUs measured in Capacity Units), assign workspaces to that capacity, and start building. There is no infrastructure to provision, no Spark clusters to size, no storage accounts to configure. The platform manages the compute layer automatically.^[1]

Lower Total Cost of Ownership

Organizations replacing five or six separately licensed Azure services with a single Fabric capacity often see significant reductions in cost, operational overhead, and time-to-delivery. We covered this in depth in the Fabric Cost Analysis post on SQLYARD.

Microsoft 365 Integration

Fabric is built into the Microsoft 365 trust boundary. Every workspace is backed by Entra ID. Sensitivity labels from Microsoft Purview Information Protection flow automatically through the platform.^[4]

3. Why NOT Fabric? Honest Trade-offs

No platform is the right answer for every situation. Here is an honest look at scenarios where Fabric may not be the best choice.

Read this before you commit

These are real limitations. Understanding them before starting a project will save you significant pain later.

Scenario	Why Fabric May Not Be the Right Fit	Alternative to Consider
Heavy custom Spark engineering with fine-grained cluster control	Fabric Spark is managed — you cannot pin specific Spark versions, install arbitrary native libraries, or use custom Docker images.	Azure Databricks
Strict multi-cloud or cloud-agnostic requirements	Fabric is Microsoft-first. While OneLake supports S3 shortcuts, the control plane is Azure.	Databricks on AWS/GCP
Ultra-low latency OLTP workloads (<5ms response)	Fabric Warehouse and Lakehouse SQL endpoints are analytical engines, not OLTP databases.	Azure SQL Database, SQL Server
Legacy SSIS packages you cannot rewrite	Fabric does not run .dtsx files natively.	Azure Data Factory SSIS Integration Runtime
Strict on-premises data residency with no cloud allowance	Fabric is cloud-only with fixed regional anchoring per capacity.	SQL Server on-prem with SSAS, SSRS
Budget too small for a paid capacity	Trial capacities expire and free tiers are limited.	Azure Synapse Analytics Serverless

For more on the Databricks vs Fabric decision, the SQLYARD comparison post walks through this in detail.

4. Every Major Fabric Component Explained

Fabric is organized into workloads — logical groupings of related capabilities. Below is every major workload and the key item types within each one.^[1]

OneLake

The single logical data lake for your entire Fabric tenant. All data stored by any Fabric item lives in OneLake in Delta Parquet format. Supports Shortcuts to external data sources (ADLS, S3, Google Cloud Storage, Dataverse) without copying data.

Lakehouse

A data store combining a data lake with database capabilities (Delta tables, SQL analytics endpoint). The primary compute surface for data engineers. Supports Spark notebooks, pipelines, and SQL read access via an automatically generated SQL Analytics Endpoint.

Data Warehouse

A fully transactional T-SQL engine running over OneLake. Supports DML (INSERT, UPDATE, DELETE), DDL, views, stored procedures, and cross-database queries. Ideal for SQL Server DBAs and DW teams.

Data Factory (Pipelines)

The orchestration and ingestion engine. Supports 200+ native connectors, Copy Data activities, conditional branching, scheduling, and triggers. The modern replacement for SSIS in Fabric workflows.

Dataflow Gen2

A Power Query-based low-code ETL tool. Enables analysts and business users to connect to sources, apply transformations, and land data into Lakehouses or Warehouses without writing code.

Notebooks

Browser-based interactive code environments supporting Python (PySpark), Scala, SQL, and R. Directly attached to a Lakehouse. Support cell-level execution, magic commands, built-in visualization, and MLflow integration.

Semantic Models

Formerly called “datasets” in Power BI. A metadata and calculation layer between your data and reports. Can use Direct Lake, Import, or DirectQuery modes. Supports DAX measures, hierarchies, and relationships.

Power BI Reports & Dashboards

Interactive visualization layer. Built in Power BI Desktop or in-browser. Tightly integrated with Semantic Models and Direct Lake for fast analytics.

Real-Time Intelligence

Covers KQL Databases, Eventstreams for ingesting streaming data, and the Real-Time Hub. Designed for event-driven workloads — IoT telemetry, log analytics, clickstream data — where freshness matters in seconds.

Data Science

Notebooks with integrated MLflow experiment tracking, model registry, and scoring. Data scientists work on the same lake data as engineers. Publish predictions back to Delta tables or the Warehouse for BI consumption.

Mirroring

Replicates data from external operational databases (Azure SQL, Cosmos DB, Snowflake, Databricks Unity Catalog) into OneLake in near real-time. The mirrored data is immediately available as Delta tables.

Microsoft Purview (Governance)

Provides data catalog, lineage tracking, sensitivity labels, and policy enforcement across the entire Fabric tenant. Every Fabric item is automatically catalogued with full data lineage from source to report.

5. OneLake — The Foundation of Everything

OneLake is the most important concept in all of Microsoft Fabric. If you understand OneLake, everything else makes sense.^[5]

The “OneDrive for Data” Analogy

Microsoft describes OneLake as “OneDrive for data.”^[5] Just as OneDrive gives you a single place for all your documents that every Microsoft 365 app can access, OneLake gives your organization a single place for all your data that every Fabric workload can access. You create one workspace, one Lakehouse, and every tool — Spark, T-SQL, Power BI, Data Science — reads the same files. There is no copying data between systems because there are no separate systems.

Technical Architecture

OneLake is built on Azure Data Lake Storage Gen2.^[5] It inherits ADLS Gen2’s hierarchical namespace, security model, and API compatibility. Existing tools that speak ADLS Gen2 — Azure Databricks, Azure Synapse, custom ADLS SDK applications — can connect directly to OneLake without modification.

The storage hierarchy is: Tenant → Workspace → Item (Lakehouse/Warehouse) → Tables or Files.

Delta Parquet — The Universal Format

All tabular data in OneLake is stored in Delta Parquet format. Delta Lake adds a transaction log (_delta_log) on top of Parquet files that enables ACID transactions, versioning, time travel, and schema enforcement. This is why a Spark notebook, a T-SQL query, and a Power BI Direct Lake model can all read the same table without any conversion step.^[5]

Why Delta Parquet matters

Before Fabric you often had data in CSV files in a data lake, a copy in a SQL warehouse table, and another copy imported into Power BI’s VertiPaq engine. Three copies, three potential inconsistencies, three places to update. Delta Parquet in OneLake collapses this to one copy that every engine reads natively.

OneLake Shortcuts

A Shortcut is a reference to data stored somewhere else — another Lakehouse, an external Azure Data Lake, an Amazon S3 bucket, a Google Cloud Storage bucket, or Dataverse.^[5] Shortcuts behave like symbolic links. They appear as a folder in your Lakehouse but contain no actual data copy. When Spark or SQL queries a Shortcut, the storage layer transparently fetches the data from the source with intelligent caching to reduce egress costs.

V-Order Optimization

When Fabric writes Delta tables it can apply V-Order — a Microsoft-developed Parquet write-time optimization that sorts, compresses, and encodes data specifically for the VertiPaq engine used by Power BI. V-Order writes are slightly slower but reads are dramatically faster, often eliminating the need for explicit pre-aggregation of large tables.

6. Lakehouse — Where Your Data Lives and Transforms

A Lakehouse in Microsoft Fabric combines the flexibility and scale of a data lake with the querying convenience of a database.^[6]

The Two Storage Areas

Files area: Raw object storage. Land CSV, Parquet, JSON, images, binary blobs — anything. This is your Bronze landing zone. Files here are not queryable via SQL until converted to Delta tables.
Tables area: Delta tables registered in the metastore. Any Delta table written here is automatically visible to Spark notebooks, SQL queries, and Power BI. This is where Silver and Gold layer data lives.

SQL Analytics Endpoint

Every Lakehouse automatically generates a SQL Analytics Endpoint — a read-only T-SQL connection point that exposes all Delta tables in the Tables area.^[6] You can connect SQL Server Management Studio, Azure Data Studio, or any TDS-compatible tool directly to this endpoint and query your lake data without writing a single line of Spark.

Lakehouse vs. Warehouse — the key distinction

Both can store and serve Delta data over SQL. The Lakehouse is Spark-primary with a SQL endpoint added on top. The Warehouse is SQL-primary with a Delta backend. If your team writes PySpark, use a Lakehouse. If your team writes T-SQL, use a Warehouse. For most organizations you will use both — Lakehouse for engineering, Warehouse for serving.

7. Data Warehouse — T-SQL at Lake Scale

The Fabric Data Warehouse is a fully transactional T-SQL engine that separates compute from storage and stores data in Delta Parquet format on OneLake.^[7] Every table in your Warehouse is also readable by Spark notebooks and other OneLake-connected tools.

Key Capabilities

Full DML support: INSERT, UPDATE, DELETE, MERGE
DDL: CREATE TABLE, CREATE VIEW, CREATE PROCEDURE, CREATE SCHEMA
Cross-database queries against other Warehouses and Lakehouses in the same workspace
Result set caching for repeated analytical queries
Data clustering for physical data organization (see SQLYARD post on Data Clustering)
COPY INTO from OneLake, ADLS, and external storage
OPENROWSET for ad-hoc file querying without loading

When to Choose Warehouse Over Lakehouse

Use Case	Lakehouse	Warehouse
PySpark / Scala transformations	✅ Primary tool	❌ Not applicable
T-SQL DML (UPDATE/DELETE/MERGE)	❌ Read-only endpoint	✅ Full support
Stored procedures and business logic in SQL	❌ Not supported	✅ Supported
Row-level security via SQL roles	Limited	✅ Full RBAC
Serving Gold layer to SQL-heavy BI tools	Acceptable	✅ Preferred
Cross-database join to multiple sources	Via Spark	✅ Native T-SQL syntax

The recommended pattern at SQLYARD

Use a Lakehouse for Bronze and Silver layers. Use a Warehouse for your Gold serving layer. Both share the same underlying OneLake data — there is no ETL between them, just a SQL cross-database reference.

8. Pipelines and Data Factory — Orchestrating the Flow

Fabric Pipelines are the orchestration engine for everything that needs to run in sequence, on a schedule, or conditionally. If you have an SSIS or Azure Data Factory background, Pipelines will feel immediately familiar.^[8]

Core Pipeline Activities

Copy Data: Ingests data from 200+ supported connectors (SQL Server, Oracle, Salesforce, REST APIs, SFTP, SAP, SharePoint, etc.) into a Lakehouse or Warehouse.
Notebook activity: Executes a Fabric Notebook as a pipeline step. Pass parameters to drive dynamic behavior.
Dataflow Gen2 activity: Runs a Dataflow Gen2 as an orchestrated step.
Stored Procedure activity: Calls a stored procedure in a Warehouse for post-load transformations or data quality checks.
Get Metadata / If Condition / ForEach / Until: Control flow activities for conditional branching and loops.
Fail activity: Explicitly fail a pipeline with a custom error message when a validation check fails.

SSIS vs. Pipelines: The Migration Reality

Fabric does not run .dtsx SSIS packages natively. Three practical migration paths:

Rebuild as Pipelines + Dataflow Gen2 — the recommended long-term path.
Lift to Azure Data Factory SSIS Integration Runtime — run existing packages in Azure while gradually rebuilding.
Hybrid: SSIS lands files to OneLake-connected storage, Fabric pipelines ingest and process from there.

Scheduling and Triggers

Pipelines can be triggered by a schedule, a storage event (a new file lands in OneLake), or called from another pipeline. A common production pattern is a daily scheduled pipeline running the full Bronze → Silver → Gold chain with email notification on failure.

9. Dataflow Gen2 — Low-Code Transformations

Dataflow Gen2 is a Power Query-based low-code ETL tool built into Fabric. Analysts and business users can connect to sources, apply transformation steps using a familiar Excel-like interface, and land results in a Lakehouse or Warehouse — without writing code.^[9]

When to Use Dataflow Gen2

Connecting to Excel, SharePoint lists, or CSV files with cleaning and reshaping requirements
Flattening nested JSON or XML structures without Spark
Business analyst-owned transformations that don’t need PySpark
Rapid prototyping of ingestion logic before formalizing in notebooks
Teams with Power BI dataflow experience migrating to Fabric

When NOT to use Dataflow Gen2

For large-scale transformations over tens or hundreds of millions of rows, PySpark notebooks are more efficient. Dataflow Gen2 is best suited to moderate-volume, business-analyst-friendly transformations.

10. Notebooks and Spark — Code-First Engineering

Fabric Notebooks are browser-based interactive coding environments attached to a Lakehouse, running on managed Apache Spark.^[10] They support Python (PySpark), Scala, SQL magic commands (%%sql), and R.

PySpark Patterns You Will Use Every Day

# Read a CSV from the Files/bronze area
df_raw = spark.read.option("header", True).option("inferSchema", True).csv("Files/bronze/sales/raw/")

from pyspark.sql.functions import col, trim, to_date, when, lit

df_silver = (
    df_raw
    .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
    .withColumn("customer_name", trim(col("customer_name")))
    .withColumn("status", when(col("status").isNull(), lit("Unknown")).otherwise(col("status")))
    .dropDuplicates(["order_id"])
    .filter(col("order_id").isNotNull())
)

df_silver.write.mode("overwrite").format("delta").saveAsTable("silver_sales_orders")

spark.sql("OPTIMIZE silver_sales_orders ZORDER BY (order_date)")
spark.sql("VACUUM silver_sales_orders RETAIN 168 HOURS")

OPTIMIZE and VACUUM — do not skip these

OPTIMIZE compacts small Parquet files and applies Z-ordering to speed up range queries. VACUUM removes old file versions no longer needed. Always run these after bulk writes to production tables.

Notebook Parameters

When a Notebook is run from a Pipeline, you can pass parameters to control its behavior — the date partition to process, the source table name, or a processing mode flag. Toggle the Parameters cell marker on the first code cell and declare your variables there.

11. Direct Lake — BI Without Import Refresh

Direct Lake is one of the most technically significant innovations in the entire Fabric platform.^[11]

The Problem: Import vs. DirectQuery

Import mode copies data into Power BI’s VertiPaq engine. Reports are fast but data goes stale between scheduled refreshes. Large datasets hit memory limits.
DirectQuery mode queries the source live on every report interaction. Always fresh but slow for complex analytical models at scale.

Direct Lake: The Best of Both

Direct Lake is a third mode unique to Fabric. The VertiPaq engine reads Delta Parquet files in OneLake directly — not via a database query, and not via a pre-loaded import. The result is import-speed query performance on always-current lake data, with no scheduled refresh required.^[11]

Direct Lake limitations to know

Direct Lake falls back to DirectQuery mode automatically if it encounters features it cannot handle natively — very complex DAX, certain relationship types, or row-level security scenarios. At lower SKUs (F2, F4) there are limits on rows and columns that can be framed into memory simultaneously.

For a full deep dive, see Part 2: Direct Lake in Microsoft Fabric on SQLYARD.

12. Real-Time Intelligence — Streaming and Event Data

For IoT sensors, financial feeds, application logs, and user clickstreams, you need data fresh in seconds. This is where Real-Time Intelligence in Fabric comes in.^[12]

Key Real-Time Components

Eventstream: An ingestion pipeline for streaming data. Connect to Azure Event Hubs, IoT Hub, Kafka, or custom REST sources. Apply real-time transformations and route to a KQL Database or Lakehouse table.
KQL Database (Kusto): A time-series optimized analytical store supporting Kusto Query Language. Ingest millions of events per second with sub-second query latency.
Real-Time Hub: A tenant-wide catalog of all streaming data sources. Discover, connect to, and share streams across workspaces without duplicating setup.
Activator: A rules-based alerting engine. Define conditions on streaming data and trigger Power Automate flows, Teams messages, or pipeline runs automatically.

See Part 3: Real-Time Semantic Models on SQLYARD for deeper coverage.

13. The Medallion Architecture — Deeply Explained

The Medallion Architecture is the most important design pattern for organizing data in Microsoft Fabric. Microsoft recommends it as the standard approach for Fabric lakehouses.^[13]

The Core Idea

Data quality degrades at the source and improves with each transformation. The Medallion model formalizes this by dividing your data lake into three distinct quality layers, each named after a precious metal.

🥉 Bronze — Raw Data

Data in its original format exactly as it arrived from the source. No transformation. No cleansing. No schema enforcement. This is your immutable archive and source of truth for reprocessing.

🥈 Silver — Cleansed & Conformed

Validated, deduplicated, standardized data. Nulls handled. Data types enforced. Dates normalized. Business keys validated. Cross-source entities resolved to a single key. Delta tables. Still row-level, not yet aggregated.

🥇 Gold — Business-Ready

Aggregated, modeled, and business-contextualized data. Star schema dimension and fact tables. Pre-computed KPIs. Subject-area data marts. Every table in Gold has a clear business owner and a defined refresh SLA.

Why Three Layers Instead of One?

No recovery path: If the transformation logic had a bug and you only have one layer, the original data is overwritten and gone.
Schema coupling: Every time the source system changes its schema, the report breaks.
Mixing concerns: Raw ingestion logic, cleansing rules, and analytical aggregation code all in one place makes it impossible to isolate and fix issues.
No audit trail: Regulatory requirements often require you to prove what the raw data looked like before transformation.

Three Recommended Physical Patterns

Pattern 1 — Three Lakehouses (Pure Lake)

Create a separate Lakehouse for Bronze, Silver, and Gold, each in its own workspace. Business users query the Gold Lakehouse via its SQL Analytics Endpoint. Best for Spark-heavy teams.

Pattern 2 — Two Lakehouses + One Warehouse (Recommended)

Bronze and Silver as Lakehouses, Gold as a Warehouse. Full T-SQL capability in the Gold serving layer while preserving Spark flexibility for engineering. Most SQL Server DBAs will feel at home here. See the SQLYARD Medallion post for full walkthrough.

Pattern 3 — Single Lakehouse with Folder Structure

A single Lakehouse with /Files/bronze, /Files/silver, and /Tables/gold. Simpler for small teams but does not provide layer-level governance or separate access control. Recommended for proof-of-concept only.

ACID Compliance and the Medallion Model

Delta Lake’s ACID transaction support means each write to a Silver or Gold Delta table is a transaction. If a notebook fails mid-write, the table is not corrupted — the incomplete transaction is rolled back and the previous version remains intact.^[13]

Medallion Layer Reference

Attribute	Bronze	Silver	Gold
Data quality	Raw, unvalidated	Validated, conformed	Aggregated, modeled
Format	Files (CSV, Parquet, JSON) or Delta	Delta tables	Delta tables
Schema enforcement	None	Enforced	Enforced, curated
Who writes it	Pipelines (raw ingest)	Notebooks, Dataflow Gen2	Notebooks, Warehouse T-SQL
Who reads it	Data engineers only	Engineers, data scientists	Analysts, BI reports, business users
Reprocessable?	Is the source	Yes, from Bronze	Yes, from Silver
Fabric item type	Lakehouse (Files area)	Lakehouse (Tables area)	Lakehouse or Warehouse

14. Hands-On Workshop — Beginner to Advanced

This workshop builds a complete Medallion lakehouse from scratch. Complete stages in order — each depends on the previous one. You will finish with a working Bronze → Silver → Gold pipeline connected to a Power BI Direct Lake report.

Prerequisites

A Microsoft Fabric trial (free at app.fabric.microsoft.com) or paid Fabric capacity. A Microsoft Entra ID account. SQL familiarity assumed; Python experience helpful for advanced stages.

Stage 1 — Beginner: Set Up Your Workspace and Lakehouse

Step 1.1 — Create a Workspace

Go to app.fabric.microsoft.com. Click Workspaces → New Workspace. Name it fabric_workshop. Assign a Fabric capacity or select Trial. Click Apply.

Step 1.2 — Create Three Lakehouses

Click New Item → Lakehouse. Create lakehouse_bronze, lakehouse_silver, and lakehouse_gold.

Step 1.3 — Explore the Lakehouse Explorer

Open lakehouse_bronze. The left pane shows Tables (Delta tables) and Files (raw object storage). The SQL Analytics Endpoint icon in the top bar switches to a T-SQL view of the same Lakehouse.

Step 1.4 — Upload a Sample CSV (Bronze Landing)

Create a sales CSV with columns: order_id, order_date, customer_id, customer_name, product, quantity, amount, country. In lakehouse_bronze, click Files → … → New subfolder, create sales/raw. Upload your CSV there.

Stage 2 — Beginner/Intermediate: First Pipeline and Bronze Ingestion

Step 2.1 — Create a Pipeline

Click New Item → Data pipeline. Name it pipeline_bronze_ingest.

Step 2.2 — Add a Copy Data Activity

Click Add pipeline activity → Copy data. Configure your source connector. Set the destination to lakehouse_bronze → Files/sales/raw/.

Step 2.3 — Schedule the Pipeline

Click the Schedule button. Enable daily schedule at your desired time. Save. Bronze ingestion is now automated.

Stage 3 — Intermediate: Silver Transformation with PySpark

Step 3.1 — Create a Notebook

Click New Item → Notebook. Name it nb_bronze_to_silver. Add lakehouse_bronze as the default Lakehouse in the left pane.

Step 3.2 — Write the Bronze → Silver Transform

from pyspark.sql.functions import col, trim, to_date, when, lit, upper
from pyspark.sql.types import DecimalType

bronze_path = "Files/sales/raw/"
df_raw = spark.read.option("header", True).option("inferSchema", True).csv(bronze_path)
print(f"Bronze row count: {df_raw.count()}")

df_silver = (
    df_raw
    .withColumn("order_date", to_date(col("order_date"), "yyyy-MM-dd"))
    .withColumn("customer_name", trim(col("customer_name")))
    .withColumn("product", trim(col("product")))
    .withColumn("country", upper(trim(col("country"))))
    .withColumn("quantity", when(col("quantity").isNull(), lit(0)).otherwise(col("quantity").cast("int")))
    .withColumn("amount", col("amount").cast(DecimalType(18,2)))
    .dropDuplicates(["order_id"])
    .filter(col("order_id").isNotNull())
)

silver_table_path = "abfss://fabric_workshop@onelake.dfs.fabric.microsoft.com/lakehouse_silver.Lakehouse/Tables/silver_sales_orders"

df_silver.write.mode("overwrite").option("overwriteSchema", "true").format("delta").save(silver_table_path)
spark.sql(f"OPTIMIZE delta.`{silver_table_path}` ZORDER BY (order_date, customer_id)")
print("Silver write complete.")

Step 3.3 — Validate in SQL Analytics Endpoint

Open lakehouse_silver. Click the SQL Analytics Endpoint icon and run:

SELECT TOP 10 * FROM silver_sales_orders ORDER BY order_date DESC;

Stage 4 — Intermediate: Gold Layer with Star Schema

Step 4.1 — Create the Gold Notebook

Create notebook nb_silver_to_gold. Add both lakehouse_silver and lakehouse_gold in the left pane.

Step 4.2 — Build Dimension and Fact Tables

df_sales = spark.read.format("delta").load(
    "abfss://fabric_workshop@onelake.dfs.fabric.microsoft.com/lakehouse_silver.Lakehouse/Tables/silver_sales_orders"
)
gold_base = "abfss://fabric_workshop@onelake.dfs.fabric.microsoft.com/lakehouse_gold.Lakehouse/Tables/"

# Dim Customer
df_dim_customer = df_sales.select("customer_id","customer_name","country").dropDuplicates(["customer_id"]).withColumnRenamed("customer_id","customer_key")
df_dim_customer.write.mode("overwrite").format("delta").save(gold_base + "dim_customer")

# Dim Date
from pyspark.sql.functions import year, month, dayofmonth, date_format, quarter
df_dim_date = (df_sales.select("order_date").dropDuplicates()
    .withColumn("year", year("order_date"))
    .withColumn("month", month("order_date"))
    .withColumn("quarter", quarter("order_date"))
    .withColumn("month_name", date_format("order_date","MMMM"))
    .withColumn("year_month", date_format("order_date","yyyy-MM")))
df_dim_date.write.mode("overwrite").format("delta").save(gold_base + "dim_date")

# Fact Sales
df_fact_sales = df_sales.select(
    col("order_id").cast("string"),
    col("order_date"),
    col("customer_id").alias("customer_key"),
    col("product"),
    col("quantity").cast("int"),
    col("amount").cast("decimal(18,2)"))
df_fact_sales.write.mode("overwrite").format("delta").save(gold_base + "fact_sales")
spark.sql(f"OPTIMIZE delta.`{gold_base}fact_sales` ZORDER BY (order_date, customer_key)")
print("Gold layer complete.")

Stage 5 — Intermediate/Advanced: Master Pipeline

Step 5.1 — Create the Master Pipeline

Create pipeline pipeline_master_etl.

Step 5.2 — Chain Notebook Activities

Add three Notebook activities in sequence: (1) nb_bronze_to_silver, (2) nb_silver_to_gold on success of step 1, (3) optional data quality check. Connect with success arrows so each step only runs if the previous succeeded.

Step 5.3 — Add Error Handling

Add a Fail activity on each notebook’s failure path. Configure email notifications in Monitoring settings so pipeline owners are alerted on failure.

Stage 6 — Advanced: Direct Lake Semantic Model and Power BI

Step 6.1 — Create a Semantic Model from Gold

Open lakehouse_gold SQL Analytics Endpoint. Click New semantic model. Select fact_sales, dim_customer, dim_date. Confirm. Fabric creates the model in Direct Lake mode.

Step 6.2 — Define Relationships

In the Model view create: fact_sales[customer_key] → dim_customer[customer_key] (Many-to-One) and fact_sales[order_date] → dim_date[order_date] (Many-to-One).

Step 6.3 — Add DAX Measures

Total Revenue = SUM(fact_sales[amount])
Order Count = DISTINCTCOUNT(fact_sales[order_id])
Avg Order Value = DIVIDE([Total Revenue], [Order Count])
Revenue YTD = TOTALYTD([Total Revenue], dim_date[order_date])

Step 6.4 — Build and Publish the Report

Click New report. Add a Matrix visual with year_month on rows, country on columns, Total Revenue as values. Add a Revenue YTD line chart. Publish to your workspace.

Stage 7 — Advanced: Incremental Processing and Watermarking

from pyspark.sql.functions import max as spark_max, lit
from datetime import datetime

watermark_table = "abfss://fabric_workshop@onelake.dfs.fabric.microsoft.com/lakehouse_silver.Lakehouse/Tables/etl_watermark"

try:
    df_watermark = spark.read.format("delta").load(watermark_table)
    last_processed = df_watermark.filter(col("table_name") == "silver_sales_orders").select("last_updated_ts").collect()[0][0]
except:
    last_processed = datetime(2000, 1, 1)

print(f"Last processed: {last_processed}")

df_new = spark.read.format("delta").load("abfss://.../lakehouse_bronze.Lakehouse/Tables/bronze_sales").filter(col("ingested_ts") > lit(last_processed))

df_new_silver = transform_to_silver(df_new)

from delta.tables import DeltaTable
silver_dt = DeltaTable.forPath(spark, "abfss://.../lakehouse_silver.Lakehouse/Tables/silver_sales_orders")
silver_dt.alias("target").merge(df_new_silver.alias("source"), "target.order_id = source.order_id").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

new_max_ts = df_new_silver.select(spark_max("ingested_ts")).collect()[0][0]
spark.createDataFrame([("silver_sales_orders", new_max_ts)], ["table_name","last_updated_ts"]).write.mode("overwrite").format("delta").save(watermark_table)
print(f"Watermark updated to: {new_max_ts}")

15. Governance, Domains, and Security

An analytics platform without governance is a liability. Fabric provides a layered governance model that maps onto how most enterprises are already organized.^[4]

Tenant-Level Administration

The Fabric Admin portal controls tenant-wide policies: which workload types are enabled, who can create workspaces, external sharing permissions, and data residency settings. For enterprise deployments, restrict workspace creation to approved groups and require capacity assignment review before any new workspace goes live.

Workspaces and Roles

Every workspace has four built-in roles: Admin, Member, Contributor, and Viewer. For Medallion deployments a common pattern is: data engineering team has Contributor on Bronze and Silver; BI team has Viewer on Silver and Contributor on Gold; data consumers have Viewer on Gold only.

Domains

Domains are a logical grouping mechanism above workspaces. A Finance domain might contain Finance Bronze, Silver, Gold, and Reports workspaces. Domains enable domain-level administrators, sensitivity policies, and discovery in the OneLake Catalog — the foundation of a data mesh architecture in Fabric.

Microsoft Purview Integration

Every Fabric item is automatically catalogued in Microsoft Purview. Sensitivity labels propagate automatically from Lakehouse tables to any Power BI report consuming them. Lineage graphs show exactly how data flows from source through Bronze, Silver, Gold, and into reports.

Row-Level Security

In the Warehouse, RLS is implemented via SQL security predicates — the same mechanism used in SQL Server. In Semantic Models, RLS is defined using DAX filter expressions. Both apply correctly to Direct Lake models.

16. Cost and Capacity Planning

Fabric is billed by Fabric Capacity Units (CUs). Every workload — Spark, Warehouse queries, pipelines, Power BI refreshes — consumes CUs from the assigned capacity.^[14]

Key Cost Principles

Capacity smoothing: Short peaks above capacity are allowed and smoothed over the following 24 hours rather than causing immediate throttling.
Pause and resume: Pause capacities when not in use — you are not charged for paused capacity. Automate with the Fabric REST API or Azure Automation.
OneLake storage: Billed separately per GB at low per-TB rates. Storage is rarely the dominant cost.
Reservation discounts: 1-year and 3-year commitments reduce CU costs by 40–60% vs pay-as-you-go.

SKU	CUs	Recommended for
F2	2	Proof of concept, individual developers
F4	4	Small team development
F8	8	Small production workloads
F16	16	Medium production analytics
F32	32	Full Direct Lake features enabled
F64	64	Enterprise analytics, Direct Lake at full performance
F128+	128+	Large-scale enterprise, multiple heavy workloads

For detailed cost breakdown and right-sizing strategies see the Fabric Cost Analysis post on SQLYARD.

17. Final Thoughts

Microsoft Fabric represents the most significant shift in the Microsoft data platform since Azure launched. It is not an incremental improvement over Synapse or ADF — it is a ground-up rethinking of how analytics infrastructure should work: unified storage, shared compute, open formats, and a single governance model spanning every workload from raw ingestion to published report.

The Medallion Architecture gives Fabric deployments the structural discipline they need to stay maintainable as they grow. Bronze keeps your raw data safe. Silver makes it technically trustworthy. Gold makes it business-useful. When something breaks — and things always break in data — you have a clear recovery path at every layer.

For SQL Server and Azure SQL DBAs, Fabric is not a replacement for your existing skills — it is an extension of them. The Warehouse speaks T-SQL. The SQL Analytics Endpoint in the Lakehouse speaks T-SQL. Direct Lake removes the import/refresh complexity that made large Power BI deployments painful. Your SQL expertise is directly transferable and immediately valuable.

If you are just getting started: create a trial, build a workspace, drop a CSV into a Lakehouse, write three lines of PySpark, and query the result in the SQL Analytics Endpoint. The entire loop takes under 30 minutes and everything you learn scales directly to an enterprise deployment.

Where to go next on SQLYARD

18. Fabric Data Agent — Conversational AI Over Your Data

The Fabric Data Agent is one of the most practical AI features in the entire Fabric platform. Announced at Microsoft Ignite 2025 and currently in preview, it lets any user — technical or not — ask plain-English questions about data stored in OneLake and receive accurate, governed answers without writing a single line of SQL, DAX, or KQL.^[21]

What Problem Does It Solve?

In most organizations, getting an answer from data still requires either writing a query, building a report, or filing a ticket with the BI team. Executives, operations managers, and sales leads who need quick answers are blocked by the need for technical skills they don’t have. At their core, Fabric data agents are interoperable AI agents that connect your enterprise data to the agentic web — think of them as virtual analysts, each serving as an expert on a specific domain of data. A user can simply ask “What were total sales in California last quarter?” and the agent writes the query, executes it against the right data source, and returns a readable answer with the underlying query visible for transparency.

How It Works

The Fabric data agent uses large language models (LLMs) to help users interact with their data naturally, applying Azure OpenAI Assistant APIs and behaving like an agent. When a question is submitted:

The agent parses the natural language question
It identifies the most relevant data source using the user’s own credentials and permissions
It generates the appropriate query — T-SQL, DAX, or KQL depending on the source
It executes the query and returns a plain-English answer alongside the query itself

Critically, the agent never bypasses security. It operates entirely within the requesting user’s existing row-level security, column-level security, and workspace permissions. If a user cannot see certain data directly, the agent cannot surface it either.

Supported Data Sources

A Fabric data agent supports up to five data sources in any combination, including lakehouses, warehouses, KQL databases, Power BI semantic models, ontologies, and Microsoft Graph. This means a single agent can answer questions that span your Gold lakehouse, your Warehouse, and your existing Power BI semantic models simultaneously — without the user knowing or caring which system the answer came from.

Setting Up a Data Agent

Step 1 — Create the Agent

In your workspace, click New Item and search for Fabric data agent. Give it a descriptive name that reflects its domain (e.g., agent_sales_analytics).

Step 2 — Connect Data Sources

The OneLake catalog opens automatically. Select up to five data sources — your Gold Lakehouse, Warehouse, or Power BI semantic models. Add each one individually. Only tables the agent owner has permission to access will appear.

Step 3 — Write Agent Instructions

This is the most important step. Write plain-English instructions that tell the agent what it covers and how to interpret your data:

You are a sales analytics assistant for SQLYARD Corp.
You have access to sales orders, customer data, and product tables.
- "Revenue" always means the sum of the amount column in fact_sales
- "Active customer" means a customer with at least one order in the last 90 days
- Always filter to the current fiscal year unless the user specifies otherwise
- Do not return individual customer PII — aggregate only

Step 4 — Add Example Questions

Provide 5–10 sample questions with their expected SQL/DAX to train the agent’s reasoning. The more specific your examples, the more accurately it responds to real questions.

Step 5 — Test, Refine, and Publish

Test with real questions from business users. Refine instructions based on where it gets things wrong. Once satisfied, click Publish. Share with colleagues — they only need Read permission on the underlying data sources, not workspace access.

Integration with Copilot, Teams, and Copilot Studio

Once published, the Data Agent can be surfaced in multiple places. Business users can query it directly from Copilot in Power BI without switching applications. Developers can connect it to Microsoft Copilot Studio to build custom organizational copilots that call the agent as part of a broader workflow. It also supports the Model Context Protocol (MCP), enabling interoperability with other AI agents across the Microsoft ecosystem.

Current preview limitations to know

Maximum five data sources per agent
Best accuracy with approximately 25 or fewer tables
Currently supports queries in English only
Read-only — no write-back or data modification
Does not perform advanced analytics, machine learning, or causal inference — it retrieves and summarizes structured data
Agent usage consumes Fabric capacity CUs — monitor usage to avoid throttling other workloads

Best use cases right now

Executive KPI summaries without navigating dashboards
Sales and operations teams getting quick ad-hoc answers
Reducing BI team ticket volume for simple data questions
Embedding conversational data access in Teams or internal portals

19. Fabric IQ — The Semantic Intelligence Layer (Preview)

Fabric IQ is the newest and most ambitious addition to the Microsoft Fabric platform. Announced at Microsoft Ignite 2025, Fabric IQ is a unified intelligence platform powered by semantic understanding and agentic AI — and it represents a fundamental shift in what Fabric is designed to do. If the rest of Fabric is about storing, moving, and querying data, Fabric IQ is about giving that data meaning.^[22]

The Problem Fabric IQ Solves

Even the best-engineered data platform has a blind spot: it stores data but not meaning. Your Gold lakehouse has a customer_id column, but it doesn’t know that in your business a “customer” is different from a “prospect,” that “revenue” excludes returns, or that an “active asset” means something specific to your operations team. That semantic knowledge lives in people’s heads, in documentation, and in the inconsistent definitions scattered across reports and spreadsheets. Without that grounding, AI cannot reason about cascading effects, constraints, or objectives. It cannot make decisions you can trust because it lacks context, policy, and meaning.

Fabric IQ solves this by giving you a place to formally define that business meaning — and making it available to every tool, report, and AI agent across your Fabric environment.

Key Components of Fabric IQ

Ontology (Preview)

At the core of Fabric IQ is the Ontology item, which introduces the semantic foundation that connects people, processes, systems, actions, rules, and data into a unified ontology. You define your business entities (Customer, Product, Asset, Order), their properties, and the relationships between them. You then bind those entities to actual tables in your Lakehouses, Warehouses, and semantic models. Once bound, every AI agent, report, and query that touches those tables understands what they represent in business terms — not just raw column names.

Fabric Graph

Once your ontology is defined, Fabric automatically builds a navigable Fabric Graph — a visual, queryable representation of how your business entities relate to each other. This is not a diagram tool; it is a live, queryable knowledge graph. AI agents can traverse it to understand multi-hop relationships: a supply chain disruption affecting a supplier affects components, which affects products, which affects orders — the graph makes that chain of reasoning explicit and automated.

Operations Agent (Preview)

Operations Agents in Fabric continuously monitor your business in real time, reason over live conditions, evaluate trade-offs, and automatically take actions to advance desired business outcomes. They use the ontology definitions, rules, and actions as their playbook. For example: an Operations Agent monitoring a logistics network can detect when delivery constraints are violated, evaluate alternative routes against business rules, and trigger the appropriate workflow — without a human in the loop.

Plan (Preview)

Plan removes the need for separate planning tools or spreadsheet-based workflows, allowing organizations to bring goals, plans, and actual results together on shared semantic models. Finance teams can build budgets, forecasts, targets, and scenario models directly on top of governed Fabric data. Projections write back securely and flow immediately through connected Power BI reports — no export to Excel, no version control nightmare, no stale numbers.

How Fabric IQ Connects to Everything Else

Fabric IQ enables the unification of analytical and operational data by combining data from various sources across OneLake — like lakehouses, eventhouses, and Power BI semantic models — into a single consistent model. The ontology sits above all of it as a shared semantic layer. The Data Agent (Section 18) becomes dramatically more accurate when grounded in an ontology — instead of guessing what columns mean from their names, it reads the business definitions you have formally declared.

A practical way to think about Fabric IQ

OneLake is where your data lives. The Medallion Architecture is how it is organized. Fabric IQ is the layer that gives it meaning. Without Fabric IQ, your Gold tables are clean and queryable but still require humans to interpret what the numbers represent. With Fabric IQ, the platform itself understands what a “Customer,” “Revenue,” or “Active Asset” means — and every AI agent, report, and downstream system inherits that understanding automatically.

Who Should Start Exploring Fabric IQ Now?

Fabric IQ is in preview and most suitable for organizations that:

Already have a mature Medallion lakehouse with clean Gold layer tables
Are dealing with inconsistent business definitions across teams and reports
Want to build AI agents that reason about business context, not just query data
Are planning or budgeting processes still running in disconnected Excel spreadsheets
Are evaluating agentic AI use cases that require grounded, governed decision-making

Preview status — what that means

Both Fabric IQ and the Data Agent are currently in preview. Preview features are available under supplemental terms of use and are not yet covered by standard SLAs. Features, pricing, and availability may change before general availability. Both are included in existing Microsoft Fabric SKU subscriptions at no additional licensing cost during the preview period.

For the official documentation, start at Microsoft Learn — What is Fabric IQ? and Microsoft Learn — Fabric Data Agent.

References

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.