Python has become the “second keyboard” for database people. Whether you manage SQL Server or Postgres on-prem, or you move data across Azure and AWS, Python sits in the middle: connecting to databases, transforming data, orchestrating jobs, testing pipelines, and even powering serverless tasks. This guide explains where Python fits, why it’s worth your time, what to watch out for, and then walks you through a practical, step-by-step workshop to go from beginner to confident practitioner. I’ve aimed this at DBAs and data engineers who want actionable code and deployment patterns, not hype.
The short answer
- Why learn Python: It speaks SQL fluently, has first-class drivers, and a mature ecosystem for data processing and orchestration. It scales from ad-hoc admin scripts to production ETL in Glue or Data Factory. Python documentation+2SQLAlchemy Documentation+2
- Where it fits:
- Admin and automation: health checks, DDL drift alerts, backup reports.
- Data movement and shaping: ingest, transform, validate, write-back.
- Orchestration: schedule and monitor pipelines.
- Cloud: serverless jobs and managed ETL. Microsoft Learn+4Apache Airflow+4Apache Airflow+4
- Key tools: DB-API drivers (pyodbc for SQL Server, psycopg2 for Postgres), SQLAlchemy for portable connections, pandas for read_sql and to_sql, Airflow for scheduling, Azure Data Factory or AWS Glue for managed pipelines. AWS Documentation+7Microsoft Learn+7Psycopg+7
Core concepts you should actually know
1) Python itself
If you’re new, skim the official tutorial for the language basics: functions, modules, virtual environments, and packaging. Keep a recent Python 3.12+ or 3.13+ on Windows and Linux; 3.14 is now out with updated docs. Python documentation+1
2) Database connectivity in Python
- DB-API 2.0 is the common contract that drivers follow. You’ll see
.connect(),.cursor(),.execute(),.commit(). - SQL Server: use pyodbc with the Microsoft ODBC Driver 18+; connection strings can live in env vars or DSNs. Microsoft Learn
- PostgreSQL: use psycopg2; open a connection, get a cursor, execute, commit. Psycopg+1
- SQLAlchemy sits on top and gives you a consistent URL format across backends:
dialect+driver://user:pass@host:port/db. It also gives you connection pooling and a modern 2.x workflow. SQLAlchemy Documentation+1
3) DataFrames and SQL
- pandas can pull result sets directly into a DataFrame with
read_sqland persist them back withto_sql. These calls ride on SQLAlchemy or a DB-API connection. Understand chunking andif_existsfor large writes. Pandas+1 - I/O in pandas spans CSV, Parquet, SQL and more; this matters when you bridge files and databases. Pandas
4) Orchestration and managed ETL
- Apache Airflow: define DAGs in Python to run extracts, transforms, and quality checks. Good for self-managed or containerized setups. Apache Airflow+1
- Azure Data Factory (ADF): use the Python SDK to create factories, pipelines, and linked services programmatically. Useful for infra-as-code patterns. Microsoft Learn+1
- AWS Glue: author PySpark jobs, streaming jobs, or Python shell jobs with autoscaling workers. Great when you want serverless Spark without running clusters. AWS Documentation+1
When Python is the right tool vs when it isn’t
Use Python when
- You need to query multiple databases, shape data, and land it elsewhere on a schedule.
- You want database-aware transformations but don’t want to write it all in T-SQL or PL/pgSQL.
- You’re building portable pipelines that can run locally, in Docker, or in managed clouds. SQLAlchemy Documentation
Consider alternatives when
- The job is purely set-based and stays inside one database. Native SQL often beats pulling millions of rows through Python.
- You’re doing heavy aggregations that fit neatly in a single SQL statement or a materialized view.
- You already have db-native scheduling and auditing that your team understands better than Airflow or Glue.
Pros
- Rich libraries, fast iteration, strong community docs, and cross-platform runtimes. Python.org
Cons
- Another runtime to patch and secure. Poorly written code can over-extract, miss indexes, or ignore set-based options. Operational sprawl if you mix Airflow, ADF, and Glue without a plan.
Common on-prem and cloud patterns
On-prem
- Admin scripts: inventory, capacity, index drift, backup verification, row-count reconciliation.
- Batch ETL: connect to SQL Server via pyodbc, read to pandas, transform, write to staging, call stored procs. Microsoft Learn+1
- Orchestration: Airflow on Kubernetes or a single VM to coordinate SQL jobs and Python steps. Apache Airflow
Cloud
- Azure: ADF pipelines call serverless compute, notebooks, or Databricks; you can build ADF assets with the Python SDK. Microsoft Learn
- AWS: Glue PySpark jobs handle large transforms with managed Spark; Python shell jobs cover lighter tasks. AWS Documentation+1
Security and reliability basics
- Secrets: never hardcode passwords in notebooks or scripts. Use environment variables, secret scopes, or cloud secret stores. SQLAlchemy URLs support externalized creds. SQLAlchemy Documentation
- Least privilege: create scoped SQL logins or roles for ETL.
- Network: prefer private endpoints, VNET/VPC integration, and TLS.
- Idempotency: design pipelines so retries don’t duplicate data.
- Chunking: when using
to_sql, setchunksizeand consider staging tables to avoid long-running locks. Pandas
Practical code you’ll reuse
Connect to SQL Server with pyodbc
import pyodbc
cn = pyodbc.connect(
"Driver={ODBC Driver 18 for SQL Server};"
"Server=tcp:myserver.database.windows.net,1433;"
"Database=SalesDW;Uid=etl_user;Pwd={MYSECRET};Encrypt=yes;TrustServerCertificate=no;"
)
cur = cn.cursor()
cur.execute("SELECT COUNT(*) FROM dbo.FactSales")
print(cur.fetchone()[0])
Quickstart and driver notes are in Microsoft’s docs. Microsoft Learn
Connect to PostgreSQL with psycopg2
import psycopg2, os
cn = psycopg2.connect(
dbname="prod", user=os.getenv("PGUSER"),
password=os.getenv("PGPASSWORD"), host="10.1.2.3", port=5432
)
with cn, cn.cursor() as cur:
cur.execute("SELECT NOW()")
print(cur.fetchone())
See the psycopg2 basic usage guide. Psycopg
Use SQLAlchemy URLs and engines
from sqlalchemy import create_engine, text
engine = create_engine("mssql+pyodbc:///?odbc_connect=Driver%3DODBC+Driver+18+for+SQL+Server;Server%3Dtcp%3Amyserver.database.windows.net;Database%3DSalesDW;Uid%3Detl_user;Pwd%3Dsecret;Encrypt%3Dyes")
with engine.begin() as conn:
rows = conn.execute(text("SELECT TOP (5) * FROM dbo.DimProduct")).fetchall()
print(rows)
SQLAlchemy 2.x engine and URL format details are here. SQLAlchemy Documentation+1
Round-trip with pandas
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://etl_user:secret@10.1.2.3:5432/prod")
df = pd.read_sql("SELECT id, order_date, amount FROM sales.orders WHERE order_date >= CURRENT_DATE - INTERVAL '7 day'", engine)
df["amount_usd"] = df["amount"].astype(float)
# write to a staging table in chunks, then MERGE in SQL
df.to_sql("orders_staging", engine, if_exists="replace", index=False, chunksize=10_000)
See read_sql and to_sql references for parameters and caveats. Pandas+1
A minimal Airflow DAG to schedule a SQL extract
from datetime import datetime
from airflow import DAG
from airflow.decorators import task
with DAG("nightly_extract", start_date=datetime(2025, 1, 1), schedule="0 2 * * *", catchup=False):
@task()
def extract():
import pandas as pd
from sqlalchemy import create_engine
eng = create_engine("mssql+pyodbc:///?odbc_connect=...")
df = pd.read_sql("SELECT * FROM dbo.Orders WHERE OrderDate >= CAST(GETDATE() AS DATE)", eng)
df.to_csv("/opt/airflow/data/orders.csv", index=False)
extract()
Airflow loads DAGs from Python files and the TaskFlow API simplifies authoring. Apache Airflow+1
Step-by-step workshop: from zero to production-ready patterns
Goal: connect to SQL Server and Postgres, move a dataset with pandas, schedule it with Airflow locally, then see how it maps to Azure Data Factory and AWS Glue.
Prereqs
- Python 3.12+ installed; create a virtual environment.
- Packages:
pip install pyodbc psycopg2-binary sqlalchemy pandas - Access to a SQL Server and a Postgres instance with a read login.
- Optional: Docker Desktop for Airflow quick start.
Reference: Python docs and the official tutorial if you need a refresher on venvs. Python documentation
Lab 1 — Connect and query
- SQL Server: write a tiny script using
pyodbcto count rows in a table. Confirm TLS and the ODBC 18 driver in your connection string. Microsoft Learn - Postgres: write the same script with
psycopg2. Use a parameterized query with placeholders to avoid SQL injection. Psycopg - Upgrade to SQLAlchemy: replace raw driver code with an Engine and
text()queries. Note how connection pooling and transactions are handled for you. SQLAlchemy Documentation
Checkpoint: save both connection URLs in environment variables, not in code. URL format reference is here. SQLAlchemy Documentation
Lab 2 — Move and shape data with pandas
- Use
pd.read_sql()to pull a 7-day slice from Orders into a DataFrame. - Add simple transforms: type casts, null handling, and a derived column.
- Persist to a staging table with
to_sql(chunksize=10000). Decide onif_exists='replace'vs'append'. Pandas+1 - In the database, complete the load with a stored procedure
MERGEto the target table. Keep Python focused on transport and light transforms; let the database handle set-based merges.
Lab 3 — Schedule with Airflow locally
- Stand up Airflow with docker-compose or your preferred local deploy.
- Create a DAG file
nightly_extract.pyusing the TaskFlow API. - Test the task in the UI and confirm the CSV or table lands as expected. Apache Airflow
Why Airflow: Python-defined DAGs, backfills, retries, SLAs, and rich operators for databases and clouds. Airflow loads DAGs from Python source files placed in its DAGs folder. Apache Airflow
Lab 4 — Map to cloud services
Azure Data Factory route:
- Use the Python SDK to create a factory, linked services (SQL Server, Blob), and a pipeline that copies from SQL to Blob. This gives you IaC for your data factory. Microsoft Learn+1
AWS Glue route:
- Create a Glue job that reads from your source with Spark, transforms with PySpark, and writes Parquet to S3. Choose a worker type and number of workers appropriate for data size. AWS Documentation+1
Operational tips for DBAs
- Performance: push heavy joins and aggregations down to SQL when feasible; fetch only required columns; use server-side filtering.
- Bulk loads: for very large writes, prefer database-native bulk APIs or stage files and
COPY/BULK INSERT.to_sqlis great but know its limits; usechunksizeand batch commits. Pandas - Testing: add unit tests for transforms and data contracts.
- Observability: log row counts, durations, and target table names; emit metrics that your NOC understands.
- Security: keep secrets in env vars or secret stores; never embed passwords in notebooks. SQLAlchemy URLs document secure patterns. SQLAlchemy Documentation
What good looks like in production
- Pipelines are idempotent and support retries.
- Ownership is clear: database change windows vs pipeline windows.
- One orchestration plane per environment to avoid split-brain scheduling.
- CI/CD lints code, runs tests, and deploys DAGs or ADF/Glue assets as artifacts.
Final thoughts
You don’t need to replace SQL with Python. You’ll get the most value when you let each do what it does best: Python for connectivity, orchestration, and non-SQL transforms; the database for set-based operations and ACID storage. Start small with one real job you already run by hand. Wrap it in Python, schedule it, observe it, and iterate.
Inline references used above
- Official Python docs and tutorial. Python documentation+1
- pandas
read_sql,to_sql, and I/O guide. Pandas+2Pandas+2 - SQLAlchemy 2.x URL format, engines, and unified tutorial. SQLAlchemy Documentation+2SQLAlchemy Documentation+2
- SQL Server connectivity with
pyodbc. Microsoft Learn - PostgreSQL connectivity with
psycopg2. Psycopg+1 - Airflow concepts and tutorials. Apache Airflow+1
- Azure Data Factory Python SDK and quickstart. Microsoft Learn+1
- AWS Glue Python and job configuration. AWS Documentation+1
Discover more from SQLYARD
Subscribe to get the latest posts sent to your email.


