PolyBase in 2025: On-Prem First, Cloud When You Need It

A practical guide to data virtualization in SQL Server from novice to expert

Introduction

In 2025, a lot of organizations are still running SQL Server on-prem. Some by choice. Some by regulation. Some because moving 50 TB of data is not a weekend project.

The good news is this. You do not need to move everything to the cloud to modernize your data access.

PolyBase still exists for a reason.

It gives you a way to query data where it lives using external tables, starting with on-prem SQL Server and expanding to cloud only when it makes sense.

This guide starts exactly where most real environments start. On-prem first. Cloud second.


Part 1: The On-Prem PolyBase Approach (Primary Path)

When on-prem PolyBase makes sense

You should start with PolyBase on-prem if:

  • Your core system is SQL Server on-prem
  • You have file shares, NAS, or local storage
  • You need read-only access to external data
  • You want to avoid ETL duplication
  • You are modernizing without replatforming

This is the most common real-world scenario.


What PolyBase actually does (on-prem)

PolyBase allows SQL Server to create external tables that reference data stored outside the database.

From SQL Server’s perspective, it looks like this:

SELECT * FROM ExternalSales;

From reality’s perspective:

  • The data might be a CSV on a file share
  • Or Parquet files on local storage
  • Or data in a data lake
  • Or another database

SQL Server reads it at query time.

No import.
No staging.
No duplication.


Supported on-prem sources in 2025

On-prem SQL Server 2022 and 2025 can read:

  • Local file system paths
  • Network file shares
  • CSV
  • Parquet
  • Delta
  • JSON
  • Other SQL databases via ODBC
  • Hadoop compatible storage

This works on Windows and Linux.


Step-by-Step: On-Prem PolyBase Setup (Novice Friendly)

Step 1: Confirm PolyBase is installed

SELECT SERVERPROPERTY('IsPolyBaseInstalled');

If you get 1, you’re good.


Step 2: Create the database master key

CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'StrongPassword!123';

This is required for credentials even on-prem.


Step 3: Create an external data source (file share example)

CREATE EXTERNAL DATA SOURCE LocalFiles
WITH (
TYPE = HADOOP,
LOCATION = 'file:///data/sales'
);

Yes, PolyBase still uses the Hadoop abstraction even locally. That is normal.


Step 4: Define the file format

CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
FORMAT_TYPE = PARQUET
);

Use Parquet whenever possible. CSV is slow and fragile.


Step 5: Create the external table

CREATE EXTERNAL TABLE ExternalSales
(
OrderID INT,
OrderDate DATE,
Revenue DECIMAL(12,2)
)
WITH (
LOCATION = '/2025/',
DATA_SOURCE = LocalFiles,
FILE_FORMAT = ParquetFormat
);

You now have a virtual table.


Step 6: Query it like a normal table

SELECT *
FROM ExternalSales
WHERE OrderDate >= '2025-01-01';

SQL Server pushes filters to the external source when possible, which is key for performance.


Performance Rules for On-Prem PolyBase (Important)

PolyBase is not OLTP. Treat it like an analytics engine.

Always do this

  • Partition files by date or region
  • Use Parquet or Delta
  • Filter early
  • Cache hot data locally
  • Use views to control access

Never do this

  • SELECT * in apps
  • Row-by-row lookups
  • High concurrency
  • Updates or deletes
  • Transactional workloads

Can You Index External Tables?

No. External tables cannot be indexed.

Instead, use this pattern:

SELECT *
INTO SalesCache
FROM ExternalSales
WHERE OrderDate >= DATEADD(day, -30, GETDATE());

Then index the cache:

CREATE INDEX IX_SalesCache_Date ON SalesCache(OrderDate);

This is the correct production pattern.


Common On-Prem Use Cases

1. File share analytics

Finance, HR, exports, logs, vendor drops

2. Pre-ingestion validation

Query data before loading it

3. Hybrid joins

Join local SQL tables with external files

4. Gradual modernization

Modern data access without migration


Part 2: Cloud Alternatives (If You Are Not On-Prem)

If you are not on-prem, or you are planning to migrate, here are the modern equivalents.


Option 1: Azure SQL + External Tables

Azure SQL Database and Managed Instance support external tables using:

  • Azure Blob Storage
  • ADLS Gen2
  • Parquet
  • Delta

This gives you the same pattern without server management.

Best for:

  • Cloud-first teams
  • Small to medium workloads
  • SaaS style apps

Option 2: Azure Synapse Serverless

If your workload is analytical:

  • No servers
  • Pay per query
  • Native data lake access
  • Better parallelism

Example:




SELECT *
FROM OPENROWSET(
BULK 'https://storage.dfs.core.windows.net/sales/*.parquet',
FORMAT='PARQUET'
) AS s;

Workshop: Build an On-Prem PolyBase Pipeline (45–60 Minutes)

Goal

By the end of this workshop, you will have:

  • A working PolyBase external table
  • Partitioned external files
  • A local indexed cache
  • A scheduled refresh job
  • A secure, production-ready view layer

This is the exact pattern used in real production environments.


Part 1: Setup (10 minutes)

Step 1: Enable and verify PolyBase

PolyBase is installed during SQL setup, but always verify.

SELECT SERVERPROPERTY('IsPolyBaseInstalled') AS IsPolyBaseInstalled;

If it returns 0, PolyBase is not installed and must be added via setup.


Step 2: Create a database for the lab

CREATE DATABASE PolyBaseLab;
GO
USE PolyBaseLab;
GO

Step 3: Create a master key

This is required for credentials even if you are using local files.

CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'StrongPassword!123';
GO

Step 4: Prepare your on-prem data folder

Create this folder on the SQL Server:

C:\polybase\data\sales\year=2025\month=01\

Put Parquet files inside it, for example:

sales_2025_01.parquet

Each file should contain:

  • OrderID
  • OrderDate
  • Revenue
  • Region

Partitioning by folder is important later.


Step 5: Create an external data source

CREATE EXTERNAL DATA SOURCE LocalSalesFiles
WITH (
TYPE = HADOOP,
LOCATION = 'file:///C:/polybase/data/sales'
);
GO

This tells SQL Server where the files live.


Step 6: Create a file format

CREATE EXTERNAL FILE FORMAT ParquetFormat
WITH (
FORMAT_TYPE = PARQUET
);
GO

Never use CSV unless you absolutely have to.


Part 2: External Table and Querying (10 minutes)

Step 1: Create the external table

CREATE EXTERNAL TABLE ExternalSales
(
OrderID INT,
OrderDate DATE,
Revenue DECIMAL(12,2),
Region VARCHAR(50)
)
WITH (
LOCATION = '/',
DATA_SOURCE = LocalSalesFiles,
FILE_FORMAT = ParquetFormat
);
GO

This maps the file structure to SQL columns.


Step 2: Query the external table

SELECT TOP 10 *
FROM ExternalSales;

You should see data immediately. If not, check:

  • File permissions
  • Folder path
  • File format

Step 3: Filter data (predicate pushdown)

SELECT *
FROM ExternalSales
WHERE OrderDate >= '2025-01-01';

SQL Server pushes this filter to the file scan, which is critical for performance.


Part 3: Partition Files and Measure Performance (10 minutes)

Step 1: Why partitioning matters

PolyBase reads files, not tables. If you put all data in one folder, SQL scans everything.

Correct structure:

sales/
└── year=2025/
└── month=01/
└── month=02/
└── month=03/

Step 2: Query a single partition

SELECT *
FROM ExternalSales
WHERE OrderDate >= '2025-02-01'
AND OrderDate < '2025-03-01';

Only the February folder is read.


Step 3: Measure performance

SET STATISTICS TIME ON;
SET STATISTICS IO ON;
SELECT SUM(Revenue)
FROM ExternalSales
WHERE OrderDate >= '2025-01-01';

Compare this before and after partitioning to see the impact.


Part 4: Cache Hot Data, Index It, and Refresh (15 minutes)

External tables cannot be indexed. This is how you fix that.


Step 1: Create a cache table

SELECT *
INTO SalesCache
FROM ExternalSales
WHERE OrderDate >= DATEADD(day, -30, GETDATE());

This creates a real SQL table.


Step 2: Add indexes

CREATE INDEX IX_SalesCache_OrderDate
ON SalesCache(OrderDate);
CREATE INDEX IX_SalesCache_Region
ON SalesCache(Region);

Now your queries are fast.


Step 3: Build a refresh stored procedure

CREATE OR ALTER PROCEDURE RefreshSalesCache
AS
BEGIN
SET NOCOUNT ON;
TRUNCATE TABLE SalesCache;
INSERT INTO SalesCache
SELECT *
FROM ExternalSales
WHERE OrderDate >= DATEADD(day, -30, GETDATE());
END;
GO

Step 4: Schedule it (SQL Agent)

Create a SQL Agent Job:

  • Step: Execute T-SQL
  • Command:
EXEC RefreshSalesCache;

Schedule:

  • Every night at 1 AM
  • Or every 4 hours if needed

This is now production-ready.


Part 5: Views, Security, and Productionizing (10 minutes)

Step 1: Create a view abstraction

Never expose external tables directly.

CREATE VIEW vSales
AS
SELECT OrderID, OrderDate, Revenue, Region
FROM SalesCache;
GO

This lets you swap sources later without breaking users.


Step 2: Secure access

Create a role:

CREATE ROLE ReportingUsers;
GRANT SELECT ON vSales TO ReportingUsers;

Add users to the role. Do NOT grant access to ExternalSales.


Step 3: Production checklist

Before going live, verify:

  • External tables are read-only
  • Cache table is indexed
  • Refresh job is monitored
  • Files are partitioned
  • Views are used for access
  • Permissions are least-privilege
  • No apps query external tables directly

Final Architecture (What You Built)

  • External files remain in storage
  • PolyBase provides virtualization
  • Cache table provides performance
  • Indexes provide speed
  • Job provides freshness
  • Views provide safety
  • Roles provide security

This is the correct PolyBase production pattern.

Final Thoughts

PolyBase in 2025 is not obsolete. It is a tool for the right problem.

If you are on-prem, it is still the fastest way to modernize without moving data.
If you are in the cloud, newer services may be better.
If you mix both, PolyBase becomes the bridge.

The mistake is not using PolyBase.
The mistake is using it for the wrong workload.


References

Microsoft Learn: PolyBase Overview
https://learn.microsoft.com/sql/relational-databases/polybase/

Microsoft Learn: CREATE EXTERNAL TABLE
https://learn.microsoft.com/sql/t-sql/statements/create-external-table-transact-sql

Microsoft Learn: External Data Sources
https://learn.microsoft.com/sql/t-sql/statements/create-external-data-source-transact-sql

Microsoft Learn: PolyBase Performance
https://learn.microsoft.com/sql/relational-databases/polybase/polybase-performance

SQL Server Documentation
https://learn.microsoft.com/sql/sql-server/


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading