A practical guide to data virtualization in SQL Server from novice to expert

Introduction

In 2025, a lot of organizations are still running SQL Server on-prem. Some by choice. Some by regulation. Some because moving 50 TB of data is not a weekend project.

The good news is this. You do not need to move everything to the cloud to modernize your data access.

PolyBase still exists for a reason.

It gives you a way to query data where it lives using external tables, starting with on-prem SQL Server and expanding to cloud only when it makes sense.

This guide starts exactly where most real environments start. On-prem first. Cloud second.

Part 1: The On-Prem PolyBase Approach (Primary Path)

When on-prem PolyBase makes sense

You should start with PolyBase on-prem if:

Your core system is SQL Server on-prem
You have file shares, NAS, or local storage
You need read-only access to external data
You want to avoid ETL duplication
You are modernizing without replatforming

This is the most common real-world scenario.

What PolyBase actually does (on-prem)

PolyBase allows SQL Server to create external tables that reference data stored outside the database.

From SQL Server’s perspective, it looks like this:

SELECT * FROM ExternalSales;

From reality’s perspective:

The data might be a CSV on a file share
Or Parquet files on local storage
Or data in a data lake
Or another database

SQL Server reads it at query time.

No import.
No staging.
No duplication.

Supported on-prem sources in 2025

On-prem SQL Server 2022 and 2025 can read:

Local file system paths
Network file shares
CSV
Parquet
Delta
JSON
Other SQL databases via ODBC
Hadoop compatible storage

This works on Windows and Linux.

Step-by-Step: On-Prem PolyBase Setup (Novice Friendly)

Step 1: Confirm PolyBase is installed

SELECT SERVERPROPERTY('IsPolyBaseInstalled');

If you get 1, you’re good.

Step 2: Create the database master key

			
CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'StrongPassword!123';

This is required for credentials even on-prem.

Step 3: Create an external data source (file share example)

			
CREATE EXTERNAL DATA SOURCE LocalFiles
WITH (
    TYPE = HADOOP,
    LOCATION = 'file:///data/sales'
);

		

Yes, PolyBase still uses the Hadoop abstraction even locally. That is normal.

Step 4: Define the file format

			
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET
);

Use Parquet whenever possible. CSV is slow and fragile.

Step 5: Create the external table

			
CREATE EXTERNAL TABLE ExternalSales
(
    OrderID INT,
    OrderDate DATE,
    Revenue DECIMAL(12,2)
)
WITH (
    LOCATION = '/2025/',
    DATA_SOURCE = LocalFiles,
    FILE_FORMAT = ParquetFormat
);

		

You now have a virtual table.

Step 6: Query it like a normal table

			
SELECT *
FROM ExternalSales
WHERE OrderDate >= '2025-01-01';

SQL Server pushes filters to the external source when possible, which is key for performance.

Performance Rules for On-Prem PolyBase (Important)

PolyBase is not OLTP. Treat it like an analytics engine.

Always do this

Partition files by date or region
Use Parquet or Delta
Filter early
Cache hot data locally
Use views to control access

Never do this

SELECT * in apps
Row-by-row lookups
High concurrency
Updates or deletes
Transactional workloads

Can You Index External Tables?

No. External tables cannot be indexed.

Instead, use this pattern:

			
SELECT *
INTO SalesCache
FROM ExternalSales
WHERE OrderDate >= DATEADD(day, -30, GETDATE());

Then index the cache:

CREATE INDEX IX_SalesCache_Date ON SalesCache(OrderDate);

This is the correct production pattern.

Common On-Prem Use Cases

1. File share analytics

Finance, HR, exports, logs, vendor drops

2. Pre-ingestion validation

Query data before loading it

3. Hybrid joins

Join local SQL tables with external files

4. Gradual modernization

Modern data access without migration

Part 2: Cloud Alternatives (If You Are Not On-Prem)

If you are not on-prem, or you are planning to migrate, here are the modern equivalents.

Option 1: Azure SQL + External Tables

Azure SQL Database and Managed Instance support external tables using:

Azure Blob Storage
ADLS Gen2
Parquet
Delta

This gives you the same pattern without server management.

Best for:

Cloud-first teams
Small to medium workloads
SaaS style apps

Option 2: Azure Synapse Serverless

If your workload is analytical:

No servers
Pay per query
Native data lake access
Better parallelism

Example:

			
SELECT *
FROM OPENROWSET(
    BULK 'https://storage.dfs.core.windows.net/sales/*.parquet',
    FORMAT='PARQUET'
) AS s;

		

Workshop: Build an On-Prem PolyBase Pipeline (45–60 Minutes)

Goal

By the end of this workshop, you will have:

A working PolyBase external table
Partitioned external files
A local indexed cache
A scheduled refresh job
A secure, production-ready view layer

This is the exact pattern used in real production environments.

Part 1: Setup (10 minutes)

Step 1: Enable and verify PolyBase

PolyBase is installed during SQL setup, but always verify.

SELECT SERVERPROPERTY('IsPolyBaseInstalled') AS IsPolyBaseInstalled;

If it returns 0, PolyBase is not installed and must be added via setup.

Step 2: Create a database for the lab

			
CREATE DATABASE PolyBaseLab;
GO
USE PolyBaseLab;
GO

Step 3: Create a master key

This is required for credentials even if you are using local files.

			
CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'StrongPassword!123';
GO

Step 4: Prepare your on-prem data folder

Create this folder on the SQL Server:

C:\polybase\data\sales\year=2025\month=01\

Put Parquet files inside it, for example:

sales_2025_01.parquet

Each file should contain:

OrderID
OrderDate
Revenue
Region

Partitioning by folder is important later.

Step 5: Create an external data source

			
CREATE EXTERNAL DATA SOURCE LocalSalesFiles
WITH (
    TYPE = HADOOP,
    LOCATION = 'file:///C:/polybase/data/sales'
);
GO

		

This tells SQL Server where the files live.

Step 6: Create a file format

			
CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
    FORMAT_TYPE = PARQUET
);
GO

		

Never use CSV unless you absolutely have to.

Part 2: External Table and Querying (10 minutes)

Step 1: Create the external table

			
CREATE EXTERNAL TABLE ExternalSales
(
    OrderID INT,
    OrderDate DATE,
    Revenue DECIMAL(12,2),
    Region VARCHAR(50)
)
WITH (
    LOCATION = '/',
    DATA_SOURCE = LocalSalesFiles,
    FILE_FORMAT = ParquetFormat
);
GO

		

This maps the file structure to SQL columns.

Step 2: Query the external table

			
SELECT TOP 10 *
FROM ExternalSales;

You should see data immediately. If not, check:

File permissions
Folder path
File format

Step 3: Filter data (predicate pushdown)

			
SELECT *
FROM ExternalSales
WHERE OrderDate >= '2025-01-01';

SQL Server pushes this filter to the file scan, which is critical for performance.

Part 3: Partition Files and Measure Performance (10 minutes)

Step 1: Why partitioning matters

PolyBase reads files, not tables. If you put all data in one folder, SQL scans everything.

Correct structure:

			
sales/
 └── year=2025/
     └── month=01/
     └── month=02/
     └── month=03/

		

Step 2: Query a single partition

			
SELECT *
FROM ExternalSales
WHERE OrderDate >= '2025-02-01'
  AND OrderDate < '2025-03-01';

Only the February folder is read.

Step 3: Measure performance

			
SET STATISTICS TIME ON;
SET STATISTICS IO ON;
SELECT SUM(Revenue)
FROM ExternalSales
WHERE OrderDate >= '2025-01-01';

		

Compare this before and after partitioning to see the impact.

Part 4: Cache Hot Data, Index It, and Refresh (15 minutes)

External tables cannot be indexed. This is how you fix that.

Step 1: Create a cache table

			
SELECT *
INTO SalesCache
FROM ExternalSales
WHERE OrderDate >= DATEADD(day, -30, GETDATE());

This creates a real SQL table.

Step 2: Add indexes

			
CREATE INDEX IX_SalesCache_OrderDate
ON SalesCache(OrderDate);
CREATE INDEX IX_SalesCache_Region
ON SalesCache(Region);

Now your queries are fast.

Step 3: Build a refresh stored procedure

			
CREATE OR ALTER PROCEDURE RefreshSalesCache
AS
BEGIN
    SET NOCOUNT ON;
    TRUNCATE TABLE SalesCache;
    INSERT INTO SalesCache
    SELECT *
    FROM ExternalSales
    WHERE OrderDate >= DATEADD(day, -30, GETDATE());
END;
GO

		

Step 4: Schedule it (SQL Agent)

Create a SQL Agent Job:

Step: Execute T-SQL
Command:

EXEC RefreshSalesCache;

Schedule:

Every night at 1 AM
Or every 4 hours if needed

This is now production-ready.

Part 5: Views, Security, and Productionizing (10 minutes)

Step 1: Create a view abstraction

Never expose external tables directly.

			
CREATE VIEW vSales
AS
SELECT OrderID, OrderDate, Revenue, Region
FROM SalesCache;
GO

		

This lets you swap sources later without breaking users.

Step 2: Secure access

Create a role:

			
CREATE ROLE ReportingUsers;
GRANT SELECT ON vSales TO ReportingUsers;

Add users to the role. Do NOT grant access to ExternalSales.

Step 3: Production checklist

Before going live, verify:

External tables are read-only
Cache table is indexed
Refresh job is monitored
Files are partitioned
Views are used for access
Permissions are least-privilege
No apps query external tables directly

Final Architecture (What You Built)

External files remain in storage
PolyBase provides virtualization
Cache table provides performance
Indexes provide speed
Job provides freshness
Views provide safety
Roles provide security

This is the correct PolyBase production pattern.

Final Thoughts

PolyBase in 2025 is not obsolete. It is a tool for the right problem.

If you are on-prem, it is still the fastest way to modernize without moving data.
If you are in the cloud, newer services may be better.
If you mix both, PolyBase becomes the bridge.

The mistake is not using PolyBase.
The mistake is using it for the wrong workload.

References

Microsoft Learn: PolyBase Overview
https://learn.microsoft.com/sql/relational-databases/polybase/

Microsoft Learn: CREATE EXTERNAL TABLE
https://learn.microsoft.com/sql/t-sql/statements/create-external-table-transact-sql

Microsoft Learn: External Data Sources
https://learn.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql

Microsoft Learn: PolyBase Performance
https://learn.microsoft.com/sql/relational-databases/polybase/polybase-performance

SQL Server Documentation
https://learn.microsoft.com/sql/sql-server/

Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Introduction

Part 1: The On-Prem PolyBase Approach (Primary Path)

When on-prem PolyBase makes sense

What PolyBase actually does (on-prem)

Supported on-prem sources in 2025

Step-by-Step: On-Prem PolyBase Setup (Novice Friendly)

Step 1: Confirm PolyBase is installed

Step 2: Create the database master key

Step 3: Create an external data source (file share example)

Step 4: Define the file format

Step 5: Create the external table

Step 6: Query it like a normal table

Performance Rules for On-Prem PolyBase (Important)

Always do this

Never do this

Can You Index External Tables?

Common On-Prem Use Cases

1. File share analytics

2. Pre-ingestion validation

3. Hybrid joins

4. Gradual modernization

Part 2: Cloud Alternatives (If You Are Not On-Prem)

Option 1: Azure SQL + External Tables

Option 2: Azure Synapse Serverless

Workshop: Build an On-Prem PolyBase Pipeline (45–60 Minutes)

Goal

Part 1: Setup (10 minutes)

Step 1: Enable and verify PolyBase

Step 2: Create a database for the lab

Step 3: Create a master key

Step 4: Prepare your on-prem data folder

Step 5: Create an external data source

Step 6: Create a file format

Part 2: External Table and Querying (10 minutes)

Step 1: Create the external table

Step 2: Query the external table

Step 3: Filter data (predicate pushdown)

Part 3: Partition Files and Measure Performance (10 minutes)

Step 1: Why partitioning matters

Step 2: Query a single partition

Step 3: Measure performance

Part 4: Cache Hot Data, Index It, and Refresh (15 minutes)

Step 1: Create a cache table

Step 2: Add indexes

Step 3: Build a refresh stored procedure

Step 4: Schedule it (SQL Agent)

Part 5: Views, Security, and Productionizing (10 minutes)

Step 1: Create a view abstraction

Step 2: Secure access

Step 3: Production checklist

Final Architecture (What You Built)

Final Thoughts

References

Share this:

Like this:

Related

Discover more from SQLYARD

Related Posts

Leave a ReplyCancel reply

Discover more from SQLYARD