A practical guide to data virtualization in SQL Server from novice to expert
Introduction
In 2025, a lot of organizations are still running SQL Server on-prem. Some by choice. Some by regulation. Some because moving 50 TB of data is not a weekend project.
The good news is this. You do not need to move everything to the cloud to modernize your data access.
PolyBase still exists for a reason.
It gives you a way to query data where it lives using external tables, starting with on-prem SQL Server and expanding to cloud only when it makes sense.
This guide starts exactly where most real environments start. On-prem first. Cloud second.
Part 1: The On-Prem PolyBase Approach (Primary Path)
When on-prem PolyBase makes sense
You should start with PolyBase on-prem if:
- Your core system is SQL Server on-prem
- You have file shares, NAS, or local storage
- You need read-only access to external data
- You want to avoid ETL duplication
- You are modernizing without replatforming
This is the most common real-world scenario.
What PolyBase actually does (on-prem)
PolyBase allows SQL Server to create external tables that reference data stored outside the database.
From SQL Server’s perspective, it looks like this:
SELECT * FROM ExternalSales;
From reality’s perspective:
- The data might be a CSV on a file share
- Or Parquet files on local storage
- Or data in a data lake
- Or another database
SQL Server reads it at query time.
No import.
No staging.
No duplication.
Supported on-prem sources in 2025
On-prem SQL Server 2022 and 2025 can read:
- Local file system paths
- Network file shares
- CSV
- Parquet
- Delta
- JSON
- Other SQL databases via ODBC
- Hadoop compatible storage
This works on Windows and Linux.
Step-by-Step: On-Prem PolyBase Setup (Novice Friendly)
Step 1: Confirm PolyBase is installed
SELECT SERVERPROPERTY('IsPolyBaseInstalled');
If you get 1, you’re good.
Step 2: Create the database master key
CREATE MASTER KEYENCRYPTION BY PASSWORD = 'StrongPassword!123';
This is required for credentials even on-prem.
Step 3: Create an external data source (file share example)
CREATE EXTERNAL DATA SOURCE LocalFilesWITH ( TYPE = HADOOP, LOCATION = 'file:///data/sales');
Yes, PolyBase still uses the Hadoop abstraction even locally. That is normal.
Step 4: Define the file format
CREATE EXTERNAL FILE FORMAT ParquetFormatWITH ( FORMAT_TYPE = PARQUET);
Use Parquet whenever possible. CSV is slow and fragile.
Step 5: Create the external table
CREATE EXTERNAL TABLE ExternalSales( OrderID INT, OrderDate DATE, Revenue DECIMAL(12,2))WITH ( LOCATION = '/2025/', DATA_SOURCE = LocalFiles, FILE_FORMAT = ParquetFormat);
You now have a virtual table.
Step 6: Query it like a normal table
SELECT *FROM ExternalSalesWHERE OrderDate >= '2025-01-01';
SQL Server pushes filters to the external source when possible, which is key for performance.
Performance Rules for On-Prem PolyBase (Important)
PolyBase is not OLTP. Treat it like an analytics engine.
Always do this
- Partition files by date or region
- Use Parquet or Delta
- Filter early
- Cache hot data locally
- Use views to control access
Never do this
- SELECT * in apps
- Row-by-row lookups
- High concurrency
- Updates or deletes
- Transactional workloads
Can You Index External Tables?
No. External tables cannot be indexed.
Instead, use this pattern:
SELECT *INTO SalesCacheFROM ExternalSalesWHERE OrderDate >= DATEADD(day, -30, GETDATE());
Then index the cache:
CREATE INDEX IX_SalesCache_Date ON SalesCache(OrderDate);
This is the correct production pattern.
Common On-Prem Use Cases
1. File share analytics
Finance, HR, exports, logs, vendor drops
2. Pre-ingestion validation
Query data before loading it
3. Hybrid joins
Join local SQL tables with external files
4. Gradual modernization
Modern data access without migration
Part 2: Cloud Alternatives (If You Are Not On-Prem)
If you are not on-prem, or you are planning to migrate, here are the modern equivalents.
Option 1: Azure SQL + External Tables
Azure SQL Database and Managed Instance support external tables using:
- Azure Blob Storage
- ADLS Gen2
- Parquet
- Delta
This gives you the same pattern without server management.
Best for:
- Cloud-first teams
- Small to medium workloads
- SaaS style apps
Option 2: Azure Synapse Serverless
If your workload is analytical:
- No servers
- Pay per query
- Native data lake access
- Better parallelism
Example:
SELECT *FROM OPENROWSET( BULK 'https://storage.dfs.core.windows.net/sales/*.parquet', FORMAT='PARQUET') AS s;
Workshop: Build an On-Prem PolyBase Pipeline (45–60 Minutes)
Goal
By the end of this workshop, you will have:
- A working PolyBase external table
- Partitioned external files
- A local indexed cache
- A scheduled refresh job
- A secure, production-ready view layer
This is the exact pattern used in real production environments.
Part 1: Setup (10 minutes)
Step 1: Enable and verify PolyBase
PolyBase is installed during SQL setup, but always verify.
SELECT SERVERPROPERTY('IsPolyBaseInstalled') AS IsPolyBaseInstalled;
If it returns 0, PolyBase is not installed and must be added via setup.
Step 2: Create a database for the lab
CREATE DATABASE PolyBaseLab;GOUSE PolyBaseLab;GO
Step 3: Create a master key
This is required for credentials even if you are using local files.
CREATE MASTER KEYENCRYPTION BY PASSWORD = 'StrongPassword!123';GO
Step 4: Prepare your on-prem data folder
Create this folder on the SQL Server:
C:\polybase\data\sales\year=2025\month=01\
Put Parquet files inside it, for example:
sales_2025_01.parquet
Each file should contain:
- OrderID
- OrderDate
- Revenue
- Region
Partitioning by folder is important later.
Step 5: Create an external data source
CREATE EXTERNAL DATA SOURCE LocalSalesFilesWITH ( TYPE = HADOOP, LOCATION = 'file:///C:/polybase/data/sales');GO
This tells SQL Server where the files live.
Step 6: Create a file format
CREATE EXTERNAL FILE FORMAT ParquetFormatWITH ( FORMAT_TYPE = PARQUET);GO
Never use CSV unless you absolutely have to.
Part 2: External Table and Querying (10 minutes)
Step 1: Create the external table
CREATE EXTERNAL TABLE ExternalSales( OrderID INT, OrderDate DATE, Revenue DECIMAL(12,2), Region VARCHAR(50))WITH ( LOCATION = '/', DATA_SOURCE = LocalSalesFiles, FILE_FORMAT = ParquetFormat);GO
This maps the file structure to SQL columns.
Step 2: Query the external table
SELECT TOP 10 *FROM ExternalSales;
You should see data immediately. If not, check:
- File permissions
- Folder path
- File format
Step 3: Filter data (predicate pushdown)
SELECT *FROM ExternalSalesWHERE OrderDate >= '2025-01-01';
SQL Server pushes this filter to the file scan, which is critical for performance.
Part 3: Partition Files and Measure Performance (10 minutes)
Step 1: Why partitioning matters
PolyBase reads files, not tables. If you put all data in one folder, SQL scans everything.
Correct structure:
sales/ └── year=2025/ └── month=01/ └── month=02/ └── month=03/
Step 2: Query a single partition
SELECT *FROM ExternalSalesWHERE OrderDate >= '2025-02-01' AND OrderDate < '2025-03-01';
Only the February folder is read.
Step 3: Measure performance
SET STATISTICS TIME ON;SET STATISTICS IO ON;SELECT SUM(Revenue)FROM ExternalSalesWHERE OrderDate >= '2025-01-01';
Compare this before and after partitioning to see the impact.
Part 4: Cache Hot Data, Index It, and Refresh (15 minutes)
External tables cannot be indexed. This is how you fix that.
Step 1: Create a cache table
SELECT *INTO SalesCacheFROM ExternalSalesWHERE OrderDate >= DATEADD(day, -30, GETDATE());
This creates a real SQL table.
Step 2: Add indexes
CREATE INDEX IX_SalesCache_OrderDateON SalesCache(OrderDate);CREATE INDEX IX_SalesCache_RegionON SalesCache(Region);
Now your queries are fast.
Step 3: Build a refresh stored procedure
CREATE OR ALTER PROCEDURE RefreshSalesCacheASBEGIN SET NOCOUNT ON; TRUNCATE TABLE SalesCache; INSERT INTO SalesCache SELECT * FROM ExternalSales WHERE OrderDate >= DATEADD(day, -30, GETDATE());END;GO
Step 4: Schedule it (SQL Agent)
Create a SQL Agent Job:
- Step: Execute T-SQL
- Command:
EXEC RefreshSalesCache;
Schedule:
- Every night at 1 AM
- Or every 4 hours if needed
This is now production-ready.
Part 5: Views, Security, and Productionizing (10 minutes)
Step 1: Create a view abstraction
Never expose external tables directly.
CREATE VIEW vSalesASSELECT OrderID, OrderDate, Revenue, RegionFROM SalesCache;GO
This lets you swap sources later without breaking users.
Step 2: Secure access
Create a role:
CREATE ROLE ReportingUsers;GRANT SELECT ON vSales TO ReportingUsers;
Add users to the role. Do NOT grant access to ExternalSales.
Step 3: Production checklist
Before going live, verify:
- External tables are read-only
- Cache table is indexed
- Refresh job is monitored
- Files are partitioned
- Views are used for access
- Permissions are least-privilege
- No apps query external tables directly
Final Architecture (What You Built)
- External files remain in storage
- PolyBase provides virtualization
- Cache table provides performance
- Indexes provide speed
- Job provides freshness
- Views provide safety
- Roles provide security
This is the correct PolyBase production pattern.
Final Thoughts
PolyBase in 2025 is not obsolete. It is a tool for the right problem.
If you are on-prem, it is still the fastest way to modernize without moving data.
If you are in the cloud, newer services may be better.
If you mix both, PolyBase becomes the bridge.
The mistake is not using PolyBase.
The mistake is using it for the wrong workload.
References
Microsoft Learn: PolyBase Overview
https://learn.microsoft.com/sql/relational-databases/polybase/
Microsoft Learn: CREATE EXTERNAL TABLE
https://learn.microsoft.com/sql/t-sql/statements/create-external-table-transact-sql
Microsoft Learn: External Data Sources
https://learn.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql
Microsoft Learn: PolyBase Performance
https://learn.microsoft.com/sql/relational-databases/polybase/polybase-performance
SQL Server Documentation
https://learn.microsoft.com/sql/sql-server/
Discover more from SQLYARD
Subscribe to get the latest posts sent to your email.


