🧊 From SQL to PySpark: A Complete Guide to Apache Iceberg Tables (Novice to Expert)

Introduction

For years, relational databases like SQL Server, Oracle, and PostgreSQL have been the backbone of data analytics. But as organizations began collecting petabytes of data, it became clear that data warehouses alone couldn’t scale efficiently or flexibly. That’s where the data lakehouse emerged—a modern architecture that combines the best of both worlds: the low-cost, flexible storage of data lakes with the structure and reliability of data warehouses.

At the heart of this shift are open table formats like Apache Iceberg, Delta Lake, and Hudi. These technologies make data lakes behave more like databases, supporting ACID transactions, schema evolution, and time travel while working seamlessly with engines like Spark, Trino, Flink, and Snowflake.

This guide walks through how Iceberg works, why it matters, how to use it with Python, PySpark, and SQL, and how to set it up from scratch.


What Is Apache Iceberg?

Apache Iceberg is an open table format built to manage large analytical datasets. It acts as a metadata layer on top of files like Parquet or ORC in a data lake.

Key Capabilities

  • ACID transactions — multiple writers without corrupting data
  • Schema evolution — add or rename columns without rewriting tables
  • Partition evolution — change partitioning strategies over time
  • Hidden partitioning — query without worrying about partition keys
  • Time travel — query data as it existed in the past
  • Multi-engine support — Spark, Trino, Flink, Snowflake, Dremio, DuckDB

Where Parquet stores just raw data, Iceberg provides structure, consistency, and version control.


Why Iceberg Instead of Just Parquet or Delta Lake?

FeatureParquetDelta LakeHudiIceberg
Schema Evolution⚠️ Basic✅ Advanced
Partition Evolution
Time Travel⚠️
Multi-Engine Compatibility⚠️ Tied to Databricks✅ Excellent
Snapshots

Iceberg stands out for multi-engine interoperability and its clean, open standard—a major advantage if you want flexibility without vendor lock-in.


Working with Iceberg in Python

You can interact with Iceberg using PyIceberg, a Python library that exposes Iceberg’s metadata and lets you build workflows without needing Spark for every task.

from pyiceberg.catalog import load_catalog

# Load a REST or Hive catalog
catalog = load_catalog("my_catalog", uri="http://localhost:8181")

# Load a table
table = catalog.load_table("analytics.sales")

# Inspect metadata
print("Table schema:", table.schema())
print("Snapshots:")
for snap in table.snapshots:
    print(snap.snapshot_id, snap.timestamp_millis)

Use cases for PyIceberg include:

  • Metadata auditing
  • Automation scripts
  • Lightweight pipeline orchestration
  • Snapshot management

Working with Iceberg in PySpark

This is where Iceberg really shines. With PySpark, you can treat Iceberg tables like regular SQL tables.

Creating an Iceberg Table

spark.sql("""
CREATE TABLE lakehouse.sales (
    order_id BIGINT,
    amount DOUBLE,
    created_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at))
""")

Inserting Data

data = [(1, 100.0, "2025-10-13 12:00:00"),
        (2, 250.0, "2025-10-13 13:30:00")]
df = spark.createDataFrame(data, ["order_id", "amount", "created_at"])
df.writeTo("lakehouse.sales").append()

Querying Data

df = spark.read.table("lakehouse.sales")
df.show()

Time Travel

# Find previous snapshot
history = spark.sql("CALL lakehouse.system.history('sales')")
history.show()

# Query at snapshot
df = spark.read.option("snapshot-id", "123456789").table("lakehouse.sales")
df.show()

This lets you build robust ETL pipelines without losing transactional integrity.


Using SQL to Query Iceberg

Iceberg integrates with SQL engines like Trino, Spark SQL, and Snowflake. Example queries:

SELECT order_id, amount
FROM lakehouse.sales
WHERE created_at >= DATE '2025-10-01';

Time travel:

SELECT *
FROM lakehouse.sales
VERSION AS OF 123456789;

View history:

CALL lakehouse.system.history('sales');

Best Practices for Using Iceberg

  1. Partition intelligently — Partition by date or key fields to improve performance without over-sharding.
  2. Use snapshot expiration — Iceberg keeps snapshots. Expire old ones regularly to manage storage.
  3. Catalog management — Use AWS Glue, Hive Metastore, or REST Catalog for production setups.
  4. Consistent writer pattern — Schedule ETL writes rather than having multiple uncontrolled writers.
  5. Schema discipline — Schema evolution is supported, but plan ahead to avoid unnecessary churn.

Advanced Use Cases (Expert Level)

  • Upserts and merges with MERGE INTO syntax:
MERGE INTO lakehouse.sales t
USING staging.sales s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET amount = s.amount
WHEN NOT MATCHED THEN INSERT *
  • Partition Evolution: Start partitioning by day, then switch to month later without rewriting old data.
  • Streaming + Batch: Iceberg works well with Spark Structured Streaming for real-time pipelines.
  • Cross-Engine Analytics: Use Spark to write data and Trino to query it with the same snapshot.

Alternatives to Iceberg

While Iceberg is a strong default choice, you might also look at:

  • Delta Lake (best with Databricks)
  • Apache Hudi (great for streaming ingestion and CDC)
  • DuckDB or Polars with Parquet (lightweight but less feature-rich)

Why choose Iceberg?

  • Open standard
  • Vendor neutral
  • Excellent performance at scale
  • Works across multiple query engines

🧪 Hands-On Setup Guide: PySpark + Iceberg (Local)

Here’s a simple way to spin up Iceberg locally for development.

1. Install Prerequisites

pip install pyspark pyiceberg

Download the Iceberg Spark runtime jar:

wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.4_2.12/1.5.0/iceberg-spark-runtime-3.4_2.12-1.5.0.jar

2. Start PySpark with Iceberg

pyspark \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.5.0 \
  --conf spark.sql.catalog.lakehouse=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.lakehouse.type=hadoop \
  --conf spark.sql.catalog.lakehouse.warehouse=./warehouse

3. Create and Query an Iceberg Table

spark.sql("""
CREATE TABLE lakehouse.sales (
    order_id BIGINT,
    amount DOUBLE,
    created_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (days(created_at))
""")

data = [(1, 500.0, "2025-10-13 10:00:00")]
df = spark.createDataFrame(data, ["order_id", "amount", "created_at"])
df.writeTo("lakehouse.sales").append()

spark.sql("SELECT * FROM lakehouse.sales").show()

🚀 Scaling Up (Production Recommendations)

  • Use a distributed catalog (AWS Glue or REST) instead of local Hadoop.
  • Store data in S3, ADLS, or GCS rather than local file systems.
  • Use Spark, Trino, or Flink as query engines.
  • Schedule snapshot expiration:
CALL lakehouse.system.expire_snapshots('sales', TIMESTAMP '2025-01-01 00:00:00');

📌 Summary

Apache Iceberg bridges the gap between data lakes and data warehouses.
It provides the structure and transactional consistency of a warehouse with the scalability and flexibility of a lake. When combined with PySpark and SQL, you can:

  • Build scalable ETL pipelines
  • Evolve schemas without downtime
  • Query historical data with time travel
  • Use multiple engines on the same table

Whether you’re coming from a SQL Server world or building next-generation pipelines, Iceberg gives you a future-proof foundation.


💬 Final Thoughts

If your organization is storing large volumes of data in a lake and you want ACID, time travel, and schema flexibility, Iceberg is worth adopting. It’s open, fast, and battle-tested in production by companies like Netflix, Apple, and LinkedIn.

Start small with a local PySpark setup, then scale up with AWS Glue and S3 for real workloads. If your data strategy depends on flexibility and interoperability, Iceberg is a smart investment.


📚 References


Discover more from SQLYARD

Subscribe to get the latest posts sent to your email.

Leave a Reply

Discover more from SQLYARD

Subscribe now to keep reading and get access to the full archive.

Continue reading