PolyBase on SQL Server 2019 & 2022 : from setup to real-world data virtualization

This post focuses on PolyBase for on-prem (or IaaS) SQL Server—not Synapse or Fabric. We’ll cover what PolyBase is, how “virtual data” works, what changed from 2019 to 2022, when it helps DBAs, developers, data-warehouse, and data-science teams, and how to set it up with copy-paste T-SQL examples.

⚠️ Important: PolyBase is great for accessing and loading/moving data, but it’s not a batch-processing engine. Think of it as data virtualization and smart ingestion, not a full ETL framework. For deeper dives, community expert Kevin Feasel has a trove of posts on PolyBase and pushdown behavior.


Where PolyBase fits (vs other options)

TechnologyBest Use CasesLimitations / Watch-outs
PolyBaseQuery external data in place; load/move data efficiently (CSV, Parquet, RDBMS, S3 in 2022). Great for virtualization and selective ELT.Not a batch ETL engine; limited to supported sources; scale-out retired in 2022.
SSISComplex ETL pipelines, transformations, workflows.Windows-only; heavy setup/maintenance.
Azure Data Factory (ADF)Cloud ETL/ELT at scale; integrate SaaS/cloud data sources.Cloud service only; latency depends on integration runtime.
Linked ServersQuick ad-hoc joins across SQL Servers.Poor performance for large sets; limited pushdown; security headaches.

What PolyBase is (and isn’t)

PolyBase lets SQL Server query external data with T-SQL via external data sources and external tables—across other SQL Server instances, Oracle/Teradata/MongoDB via ODBC, and file sources like CSV/Parquet. You can join external tables with local ones, and the engine will try to push filters/projections to the source (“predicate pushdown”) so less data flows over the wire. Think of it as data virtualization to explore or augment without copying everything first.

2019 vs. 2022 at a glance

Practical takeaway: use PolyBase to query in place, then ingest surgically (ELT) once you know the slice is valuable. CETAS (Create External Table As Select) is handy for exporting results to files when you do decide to land them.


How “virtual data” works (novice → advanced)

PolyBase uses a few building blocks:

  1. Credential (optional): stores secrets for the remote source.
  2. External data source: where the data lives (SQL/ODBC, Hadoop-style endpoints, S3 in 2022).
  3. External file format (for files): CSV or Parquet definitions.
  4. External table: a SQL schema mapped onto the external data. Query it with normal T-SQL.

Predicate pushdown is the big performance trick: clear WHERE filters and column projections are pushed to the source; not every expression can be pushed, so favor explicit predicates. Microsoft documents pushdown computations and how to tell if pushdown occurred.


When to use PolyBase (and when not to)

  • Great for:
    • Virtualizing external data for exploration or reporting.
    • Targeted ingestion (ELT) from external databases or files.
    • Hybrid DW patterns: keep “hot” facts local; leave “cold” history in files/S3; join via views.
  • Not optimal for: heavy iterative batch transformations or complex pipelines entirely inside SQL—use SSIS/ADF/Spark for the heavy lifting and let PolyBase feed those pipelines or pull slices efficiently.

Setup (Windows or Linux)

Install PolyBase Query Service for External Data using SQL Server setup, then configure external objects with T-SQL. On Linux, install the PolyBase packages and enable the service similarly.

Verify the feature & services

  • Ensure the PolyBase service is installed and running.
  • Confirm SQL Server network protocols (TCP/Shared Memory) are enabled; they’re required for PolyBase components to talk to the engine.

Example A: SQL Server → SQL Server (via ODBC)

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Strong#Passw0rd!';

CREATE DATABASE SCOPED CREDENTIAL RemoteSqlCred
WITH IDENTITY = 'sql_user', SECRET = 'P@ssw0rd!';

CREATE EXTERNAL DATA SOURCE RemoteSql
WITH (
  LOCATION = 'sqlserver://ProdSQL01.contoso.local',
  CONNECTION_OPTIONS = 'Driver={ODBC Driver 18 for SQL Server};TrustServerCertificate=Yes',
  CREDENTIAL = RemoteSqlCred
);

CREATE EXTERNAL TABLE dbo.Orders_ext
(
  OrderID     INT,
  CustomerID  INT,
  OrderDate   DATE,
  Amount      DECIMAL(18,2)
)
WITH ( DATA_SOURCE = RemoteSql, SCHEMA_NAME='dbo', OBJECT_NAME='Orders' );

SELECT OrderID, Amount
FROM dbo.Orders_ext
WHERE OrderDate >= '2025-01-01';

Example B: Oracle/Teradata/anything via ODBC

CREATE DATABASE SCOPED CREDENTIAL OraCred
WITH IDENTITY='ORA_USER', SECRET='P@ss!';

CREATE EXTERNAL DATA SOURCE OracleHR
WITH ( LOCATION='odbc://', CONNECTION_OPTIONS='DSN=Oracle19c_DSN', CREDENTIAL=OraCred );

CREATE EXTERNAL TABLE hr.Employees_ext
(
  EMPNO  INT,
  ENAME  NVARCHAR(50),
  DEPTNO INT
)
WITH ( DATA_SOURCE = OracleHR, SCHEMA_NAME='HR', OBJECT_NAME='EMPLOYEES' );

Example C: Files in a data lake (CSV/Parquet via Hadoop endpoints)

CREATE EXTERNAL FILE FORMAT CsvFF
WITH ( FORMAT_TYPE = DELIMITEDTEXT,
       FORMAT_OPTIONS ( FIELD_TERMINATOR = ',', STRING_DELIMITER = '\"', FIRST_ROW = 2 ) );

CREATE EXTERNAL DATA SOURCE Lake
WITH ( LOCATION = 'hdfs://namenode:8020' );

CREATE EXTERNAL TABLE ext.SalesCsv
(
  OrderDate  DATE,
  ProductId  INT,
  Qty        INT,
  Amount     DECIMAL(18,2)
)
WITH ( LOCATION = '/raw/sales/*.csv', DATA_SOURCE = Lake, FILE_FORMAT = CsvFF );

SELECT ProductId, SUM(Amount)
FROM ext.SalesCsv
WHERE OrderDate >= '2025-01-01'
GROUP BY ProductId;

Example D: SQL Server 2022 → S3 Parquet

CREATE DATABASE SCOPED CREDENTIAL S3Cred
WITH IDENTITY = 'S3 Access Key Id', SECRET = 'S3 Secret Access Key';

CREATE EXTERNAL DATA SOURCE S3Sales
WITH ( LOCATION = 's3://my-bucket', CREDENTIAL = S3Cred );

CREATE EXTERNAL FILE FORMAT ParquetFF
WITH ( FORMAT_TYPE = PARQUET );

CREATE EXTERNAL TABLE ext.Sales_Parquet
(
  OrderDate  DATE,
  ProductId  INT,
  Qty        INT,
  Amount     DECIMAL(18,2)
)
WITH ( DATA_SOURCE = S3Sales, LOCATION = '/sales/2025/', FILE_FORMAT = ParquetFF );

SELECT TOP (20) *
FROM ext.Sales_Parquet
WHERE Amount > 1000;

Viewing and validating setup

Use SSMS, ADS, or VS Code + MSSQL to browse External Resources. For metadata checks:

SELECT * FROM sys.external_data_sources;
SELECT * FROM sys.external_tables;
SELECT * FROM sys.external_file_formats;

Check execution plans to confirm predicate pushdown occurred.


Design patterns that work

  • Virtualize first, ingest later → start with external tables, ingest slices that prove valuable.
  • Favor Parquet → columnar format + pushdown reduces bytes scanned.
  • Pushdown-friendly SQL → explicit WHERE filters, avoid wrapping columns in functions.
  • Views as contracts → abstract external complexity for BI/science users.
  • Heavy joins → stage into local tables with indexes.

Performance & reliability checklist

  • Network + source limits drive throughput.
  • Scale-out is gone in 2022 → design for scale-up and partitioning.
  • Credentials → use DB-scoped credentials with least privilege.
  • Stats & types → align with external schemas; stage for predictable plans.
  • Ops → wrap external queries in procs; monitor latency, row counts, timeouts.

How DBAs, developers, DW, and data-science teams benefit

  • DBAs: fewer copies, governed access, ingestion predictability.
  • Developers: one T-SQL surface across sources.
  • DW teams: hybrid virtualize-then-ingest patterns.
  • Data science: curated views accessible from notebooks and BI.

Summary

PolyBase (2019/2022) turns SQL Server into a data-virtualization hub. In 2019 you had scale-out groups; in 2022 you gain S3 support and operate scale-up only. Use PolyBase to query externally with pushdown, then land only what matters (ELT) with INSERT…SELECT or CETAS. It’s ideal for access and targeted loading/movement, not for heavy batch transformations.


Final thoughts

Start small: set up one external source and one external table, wrap it in a view, and demo a real win. Measure latency and row counts. If reused heavily, ingest and index; if niche, keep it virtual and skip the ETL. For deeper patterns, Kevin Feasel’s blog is the go-to resource—his PolyBase articles cover pushdown limits, patterns, and pitfalls.


References


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading