Building a Microsoft Fabric Medallion Lakehouse + Warehouse

From Clean Slate to Queryable Model with PySpark, Pipelines, and Direct Lake

Introduction

When you stand up a new analytics stack, the medallion architecture in Microsoft Fabric gives you a strong foundation. You land raw data in Bronze, standardize in Silver, and model or aggregate in Gold. Fabric lets you use Delta tables natively, wire ingestion through pipelines or dataflows, and serve analytics through either a Warehouse or Direct Lake semantic model.

This guide walks through everything a new team needs to set up Fabric from scratch:

Building a workspace
Getting data into Fabric (Excel, Parquet, Pipelines, or SSIS migration)
Creating a Lakehouse and medallion structure
Transforming with PySpark
Adding a serving layer (Warehouse or Direct Lake)
Orchestrating and scheduling the entire flow

Architecture at a Glance

OneLake: single logical data lake for your organization. Shortcuts act like symbolic links to data stored elsewhere, letting you unify data without copying it.
Lakehouse: stores files and Delta tables. Ideal for Bronze/Silver/Gold layers.
Warehouse vs Lakehouse for serving: use Warehouse for T-SQL workloads and RBAC; use Direct Lake for BI speed at scale.
Pipelines & Dataflows: modern ingestion and orchestration, replacing legacy SSIS.

Getting Data Into Fabric (Bronze Layer)

One of the first steps in a medallion build is deciding how to ingest data. Microsoft Fabric offers flexible paths depending on the source type.

1) Excel / CSV Uploads (Fast Start)

From your Lakehouse, click Upload under “Files.”
Drop .xlsx or .csv files into Files/bronze/<source>/raw.
Once uploaded, the data is immediately available to notebooks and pipelines.
You can also schedule periodic replacement using a pipeline.

👉 Best for: low-volume business files, early proofs of concept.

2) Parquet (Preferred for Performance)

Export data from the source system as Parquet.
Upload directly or land through a pipeline.
Or create a shortcut if Parquet is already stored in Azure or OneLake.

👉 Why Parquet: it’s columnar and works seamlessly with Delta tables, making transformations faster and storage cheaper.

3) Pipelines (Automated Ingestion)

In the workspace, New → Pipeline.
Add a Copy Data activity.
Source: Excel, CSV, Parquet, SQL Database, or another supported connector.
Sink: Lakehouse → Files.
Schedule the pipeline to run daily or on demand.

👉 Think of Pipelines as the modern replacement for SSIS data flows.

4) Dataflow Gen2 (Excel-Friendly)

Use a Power Query–style interface to connect to Excel, SharePoint, SQL, and other sources.
Apply cleaning steps and land the result directly in the Lakehouse.
Can be chained into a pipeline later.

👉 Best for: analysts and business teams without heavy coding needs.

5) Migrating SSIS Packages (If Needed)

Fabric doesn’t run SSIS .dtsx directly.
Options:
- Lift to Azure Data Factory SSIS IR, or
- Rebuild logic as Fabric Pipelines/Dataflows.
SSIS can still land files into OneLake or Azure Storage for Fabric to pick up.

Recommended Bronze Ingestion Pattern

Source Type	Recommended Tool	Pros	Notes
Excel/CSV (manual)	Upload / Dataflow Gen2	Simple, quick POC	Not ideal for high volume
Parquet	Shortcut / Pipeline	Fast, efficient	Preferred for steady workloads
SQL or API	Pipeline (Copy Data)	Automated, reliable	Supports scheduling and chaining
SSIS (legacy)	Rebuild as Pipeline	Modern replacement	Long-term migration path

✅ Pro Tip: Always keep a clean folder structure like /bronze/source_name/raw. Even for manual uploads, a consistent layout makes downstream transforms simpler.

Step-by-Step Workshop

1) Create a Fabric Workspace and Lakehouse

Create a workspace in Fabric and assign capacity.
New → Lakehouse and name it (e.g., lakehouse_analytics).
The Lakehouse has Files and Tables areas. Delta tables appear automatically in “Tables.”

2) Wire Up Sources with OneLake Shortcuts

If your data is already in storage:

New → Shortcut in Lakehouse.
Pick Azure Data Lake, another Lakehouse, or external storage.
Shortcut acts like a folder with zero duplication.

3) Create Bronze / Silver / Gold Structure

In “Files,” create folders:

/bronze/<source>/raw/
/silver/<subject_area>/
/gold/<mart_name>/

This structure keeps the pipeline organized and predictable.

4) Transform Data with PySpark

Bronze → Silver Example:

from pyspark.sql.functions import col, trim, to_timestamp

bronze_path = "Files/bronze/sales/raw/"
df_raw = spark.read.option("header", True).csv(bronze_path)

df_silver = (
    df_raw
    .withColumn("order_ts", to_timestamp(col("order_date"), "yyyy-MM-dd HH:mm:ss"))
    .withColumn("customer_name", trim(col("customer_name")))
    .dropDuplicates()
)

df_silver.write.mode("overwrite").format("delta").saveAsTable("silver.sales_orders")
spark.sql("OPTIMIZE silver.sales_orders ZORDER BY (order_ts)")

5) Silver → Gold Example

spark.sql("""
CREATE OR REPLACE TABLE gold.dim_customer AS
SELECT DISTINCT customer_id, customer_name, COALESCE(country,'Unknown') AS country
FROM silver.sales_orders
""")

spark.sql("""
CREATE OR REPLACE TABLE gold.fact_sales AS
SELECT
  CAST(order_id AS BIGINT) order_id,
  customer_id,
  order_ts,
  CAST(total_amount AS DECIMAL(18,2)) total_amount
FROM silver.sales_orders
""")

6) Add Serving Layer: Warehouse or Direct Lake

Warehouse:

New → Warehouse
Create external tables over Gold Delta or load into native tables.
Expose stable views for downstream apps.

Direct Lake:

Create a Power BI semantic model that connects directly to your Gold Delta tables.
Ideal for large, read-heavy models with minimal latency.

7) Orchestrate with Pipelines

Create a pipeline that runs in sequence:
1. Bronze ingestion (Dataflow Gen2 or Copy Data)
2. Silver transform notebook
3. Gold model notebook
4. Refresh semantic model or Warehouse
Schedule nightly or hourly refresh.

8) Governance and OneLake Domains

Group datasets and pipelines into domains.
Use shortcuts to share data products across domains without duplication.

9) Optional: Local Dev Containers

While Fabric runs Spark as a managed service, a local PySpark container is useful for:

Authoring transformations offline
Unit testing
dbt-style structure before migrating to Fabric notebooks

Performance Checklist

Use Parquet and Delta as your ingestion and storage format.
Partition and OPTIMIZE large tables.
Use Gold star schemas for BI.
Pick Direct Lake for analytics, Warehouse for T-SQL-heavy apps.
Keep pipelines modular and scheduled.

Final Thoughts

The medallion model in Microsoft Fabric gives you a clean, modular foundation:

Bronze for landing data
Silver for validation and standardization
Gold for dimensional modeling and BI

Start simple — Excel or Parquet into Bronze — then layer in PySpark, Pipelines, and Direct Lake. This pattern scales smoothly from small projects to enterprise-wide data platforms, and it’s much easier to govern than one big “catch-all” table.

References

Microsoft Docs: Implement a medallion lakehouse architecture in Fabric
Microsoft Docs: Lakehouse and Delta tables
Microsoft Docs: OneLake Shortcuts
Microsoft Docs: Fabric Warehouse
Microsoft Docs: Direct Lake
Microsoft Docs: Data Factory in Fabric / Pipelines
Microsoft Docs: Dataflow Gen2
Microsoft Docs: Fabric Notebooks and PySpark
Databricks: Medallion Architecture primer

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Building a Microsoft Fabric Medallion Lakehouse + Warehouse

From Clean Slate to Queryable Model with PySpark, Pipelines, and Direct Lake

Introduction

Architecture at a Glance

Getting Data Into Fabric (Bronze Layer)

1) Excel / CSV Uploads (Fast Start)

2) Parquet (Preferred for Performance)

3) Pipelines (Automated Ingestion)

4) Dataflow Gen2 (Excel-Friendly)

5) Migrating SSIS Packages (If Needed)

Recommended Bronze Ingestion Pattern

Step-by-Step Workshop

1) Create a Fabric Workspace and Lakehouse

2) Wire Up Sources with OneLake Shortcuts

3) Create Bronze / Silver / Gold Structure

4) Transform Data with PySpark

6) Add Serving Layer: Warehouse or Direct Lake

7) Orchestrate with Pipelines

8) Governance and OneLake Domains

9) Optional: Local Dev Containers

Performance Checklist

Final Thoughts

References

Like this:

Related

Discover more from SQLYARD

Leave a ReplyCancel reply

Sign up to receive email updates, fresh news and more!

From Clean Slate to Queryable Model with PySpark, Pipelines, and Direct Lake

Introduction

Architecture at a Glance

Getting Data Into Fabric (Bronze Layer)

1) Excel / CSV Uploads (Fast Start)

2) Parquet (Preferred for Performance)

3) Pipelines (Automated Ingestion)

4) Dataflow Gen2 (Excel-Friendly)

5) Migrating SSIS Packages (If Needed)

Recommended Bronze Ingestion Pattern

Step-by-Step Workshop

1) Create a Fabric Workspace and Lakehouse

2) Wire Up Sources with OneLake Shortcuts

3) Create Bronze / Silver / Gold Structure

4) Transform Data with PySpark

6) Add Serving Layer: Warehouse or Direct Lake

7) Orchestrate with Pipelines

8) Governance and OneLake Domains

9) Optional: Local Dev Containers

Performance Checklist

Final Thoughts

References

Share this:

Like this:

Related

Discover more from SQLYARD

Related Posts

Leave a ReplyCancel reply

Discover more from SQLYARD