From Clean Slate to Queryable Model with PySpark, Pipelines, and Direct Lake
Introduction
When you stand up a new analytics stack, the medallion architecture in Microsoft Fabric gives you a strong foundation. You land raw data in Bronze, standardize in Silver, and model or aggregate in Gold. Fabric lets you use Delta tables natively, wire ingestion through pipelines or dataflows, and serve analytics through either a Warehouse or Direct Lake semantic model.
This guide walks through everything a new team needs to set up Fabric from scratch:
- Building a workspace
- Getting data into Fabric (Excel, Parquet, Pipelines, or SSIS migration)
- Creating a Lakehouse and medallion structure
- Transforming with PySpark
- Adding a serving layer (Warehouse or Direct Lake)
- Orchestrating and scheduling the entire flow
Architecture at a Glance
- OneLake: single logical data lake for your organization. Shortcuts act like symbolic links to data stored elsewhere, letting you unify data without copying it.
- Lakehouse: stores files and Delta tables. Ideal for Bronze/Silver/Gold layers.
- Warehouse vs Lakehouse for serving: use Warehouse for T-SQL workloads and RBAC; use Direct Lake for BI speed at scale.
- Pipelines & Dataflows: modern ingestion and orchestration, replacing legacy SSIS.
Getting Data Into Fabric (Bronze Layer)
One of the first steps in a medallion build is deciding how to ingest data. Microsoft Fabric offers flexible paths depending on the source type.
1) Excel / CSV Uploads (Fast Start)
- From your Lakehouse, click Upload under “Files.”
- Drop
.xlsxor.csvfiles intoFiles/bronze/<source>/raw. - Once uploaded, the data is immediately available to notebooks and pipelines.
- You can also schedule periodic replacement using a pipeline.
👉 Best for: low-volume business files, early proofs of concept.
2) Parquet (Preferred for Performance)
- Export data from the source system as Parquet.
- Upload directly or land through a pipeline.
- Or create a shortcut if Parquet is already stored in Azure or OneLake.
👉 Why Parquet: it’s columnar and works seamlessly with Delta tables, making transformations faster and storage cheaper.
3) Pipelines (Automated Ingestion)
- In the workspace, New → Pipeline.
- Add a Copy Data activity.
- Source: Excel, CSV, Parquet, SQL Database, or another supported connector.
- Sink: Lakehouse → Files.
- Schedule the pipeline to run daily or on demand.
👉 Think of Pipelines as the modern replacement for SSIS data flows.
4) Dataflow Gen2 (Excel-Friendly)
- Use a Power Query–style interface to connect to Excel, SharePoint, SQL, and other sources.
- Apply cleaning steps and land the result directly in the Lakehouse.
- Can be chained into a pipeline later.
👉 Best for: analysts and business teams without heavy coding needs.
5) Migrating SSIS Packages (If Needed)
- Fabric doesn’t run SSIS
.dtsxdirectly. - Options:
- Lift to Azure Data Factory SSIS IR, or
- Rebuild logic as Fabric Pipelines/Dataflows.
- SSIS can still land files into OneLake or Azure Storage for Fabric to pick up.
Recommended Bronze Ingestion Pattern
| Source Type | Recommended Tool | Pros | Notes |
|---|---|---|---|
| Excel/CSV (manual) | Upload / Dataflow Gen2 | Simple, quick POC | Not ideal for high volume |
| Parquet | Shortcut / Pipeline | Fast, efficient | Preferred for steady workloads |
| SQL or API | Pipeline (Copy Data) | Automated, reliable | Supports scheduling and chaining |
| SSIS (legacy) | Rebuild as Pipeline | Modern replacement | Long-term migration path |
âś… Pro Tip: Always keep a clean folder structure like /bronze/source_name/raw. Even for manual uploads, a consistent layout makes downstream transforms simpler.
Step-by-Step Workshop
1) Create a Fabric Workspace and Lakehouse
- Create a workspace in Fabric and assign capacity.
- New → Lakehouse and name it (e.g.,
lakehouse_analytics). - The Lakehouse has Files and Tables areas. Delta tables appear automatically in “Tables.”
2) Wire Up Sources with OneLake Shortcuts
If your data is already in storage:
- New → Shortcut in Lakehouse.
- Pick Azure Data Lake, another Lakehouse, or external storage.
- Shortcut acts like a folder with zero duplication.
3) Create Bronze / Silver / Gold Structure
In “Files,” create folders:
/bronze/<source>/raw/
/silver/<subject_area>/
/gold/<mart_name>/
This structure keeps the pipeline organized and predictable.
4) Transform Data with PySpark
Bronze → Silver Example:
from pyspark.sql.functions import col, trim, to_timestamp
bronze_path = "Files/bronze/sales/raw/"
df_raw = spark.read.option("header", True).csv(bronze_path)
df_silver = (
df_raw
.withColumn("order_ts", to_timestamp(col("order_date"), "yyyy-MM-dd HH:mm:ss"))
.withColumn("customer_name", trim(col("customer_name")))
.dropDuplicates()
)
df_silver.write.mode("overwrite").format("delta").saveAsTable("silver.sales_orders")
spark.sql("OPTIMIZE silver.sales_orders ZORDER BY (order_ts)")
5) Silver → Gold Example
spark.sql("""
CREATE OR REPLACE TABLE gold.dim_customer AS
SELECT DISTINCT customer_id, customer_name, COALESCE(country,'Unknown') AS country
FROM silver.sales_orders
""")
spark.sql("""
CREATE OR REPLACE TABLE gold.fact_sales AS
SELECT
CAST(order_id AS BIGINT) order_id,
customer_id,
order_ts,
CAST(total_amount AS DECIMAL(18,2)) total_amount
FROM silver.sales_orders
""")
6) Add Serving Layer: Warehouse or Direct Lake
Warehouse:
- New → Warehouse
- Create external tables over Gold Delta or load into native tables.
- Expose stable views for downstream apps.
Direct Lake:
- Create a Power BI semantic model that connects directly to your Gold Delta tables.
- Ideal for large, read-heavy models with minimal latency.
7) Orchestrate with Pipelines
- Create a pipeline that runs in sequence:
- Bronze ingestion (Dataflow Gen2 or Copy Data)
- Silver transform notebook
- Gold model notebook
- Refresh semantic model or Warehouse
- Schedule nightly or hourly refresh.
8) Governance and OneLake Domains
- Group datasets and pipelines into domains.
- Use shortcuts to share data products across domains without duplication.
9) Optional: Local Dev Containers
While Fabric runs Spark as a managed service, a local PySpark container is useful for:
- Authoring transformations offline
- Unit testing
- dbt-style structure before migrating to Fabric notebooks
Performance Checklist
- Use Parquet and Delta as your ingestion and storage format.
- Partition and OPTIMIZE large tables.
- Use Gold star schemas for BI.
- Pick Direct Lake for analytics, Warehouse for T-SQL-heavy apps.
- Keep pipelines modular and scheduled.
Final Thoughts
The medallion model in Microsoft Fabric gives you a clean, modular foundation:
- Bronze for landing data
- Silver for validation and standardization
- Gold for dimensional modeling and BI
Start simple — Excel or Parquet into Bronze — then layer in PySpark, Pipelines, and Direct Lake. This pattern scales smoothly from small projects to enterprise-wide data platforms, and it’s much easier to govern than one big “catch-all” table.
References
- Microsoft Docs: Implement a medallion lakehouse architecture in Fabric
- Microsoft Docs: Lakehouse and Delta tables
- Microsoft Docs: OneLake Shortcuts
- Microsoft Docs: Fabric Warehouse
- Microsoft Docs: Direct Lake
- Microsoft Docs: Data Factory in Fabric / Pipelines
- Microsoft Docs: Dataflow Gen2
- Microsoft Docs: Fabric Notebooks and PySpark
- Databricks: Medallion Architecture primer
Discover more from SQLYARD
Subscribe to get the latest posts sent to your email.


