Azure PostgreSQL Databases

In Weeks 3 and 4, your pipeline wrote to SQLite. That works locally, but SQLite is a file-based database: it does not support concurrent access and cannot run as a shared service. In the cloud, you need a real database server. Azure Database for PostgreSQL is a managed service that runs Postgres without you managing servers.

Your teacher has created a shared Postgres server for the class. You will connect to it, create tables in your own database, and write pipeline output there.

By the end of this chapter, you should be able to connect to Azure Postgres from Python, create tables, insert rows, and understand basic security and access patterns.

Concepts

From SQLite to Postgres

In Week 3 you used SQLite because it requires no setup: it is a single file on disk. SQLite is perfect for learning SQL, prototyping, and single-user pipelines. It is embedded in Python's standard library, runs everywhere, and powers billions of devices (every smartphone, every browser).

But SQLite has limits. It supports only one writer at a time. It cannot accept network connections, so other services cannot reach it. And when your container stops, the file is gone unless you mount a volume.

PostgreSQL is the next step up. It is a full client-server database: it runs as a separate process (or managed service), accepts connections over the network, and handles multiple users writing at the same time. PostgreSQL has been in active development since 1996 and is consistently ranked as the most popular database among professional developers in Stack Overflow's annual survey.

	SQLite	PostgreSQL
Architecture	Embedded (library, no server)	Client-server (runs as a service)
Concurrency	Single writer at a time	Many concurrent readers and writers
Network access	Local file only	TCP connections from anywhere
Data types	Flexible (any column accepts any type)	Strict types, enforced at insert
Scale	Small to medium datasets	Terabytes, millions of rows per second
Managed cloud option	None (it is a file)	Azure, AWS RDS, Google Cloud SQL
Best for	Prototyping, local pipelines, embedded apps	Production pipelines, multi-user, cloud

Both SQLite and PostgreSQL are relational databases: they store data in tables with rows and columns, and you query them with SQL. There are also non-relational (NoSQL) databases like MongoDB and Redis that are designed for different access patterns, but relational databases remain the standard for structured data in pipelines. For the full history of how databases evolved from flat files to managed services, see History of Cloud Computing.

The SQL you learned with SQLite carries over directly. CREATE TABLE, INSERT, SELECT, ON CONFLICT all work the same way. The main differences are the connection setup and the fact that Postgres enforces types more strictly.

<aside> 📘 Core program connection: You learned relational database fundamentals and SQL in the Core program. Here you apply the same ideas to a managed Postgres service. Refresher: https://www.notion.so/hackyourfuture/Introduction-to-databases-and-persistence-2fb50f64ffc981e683e0ddf822b57be5

</aside>

What "managed" means

When you install Postgres on your own machine, you are responsible for everything: updates, backups, disk space, security patches, and monitoring. A managed database service like Azure Database for PostgreSQL handles all of that for you. You get a running Postgres server with automatic backups, patching, and monitoring. You focus on writing SQL; Azure handles the infrastructure.

This is the same pattern as other managed services in this track: Azure Blob Storage manages your files, Azure Container Apps manages your containers, and Azure Database for PostgreSQL manages your database.

Connection strings and SSL

To connect to a database, you need a connection string: a URL that contains the host, port, database name, username, and password.

Azure Postgres requires SSL by default, which encrypts the connection between your code and the database. A typical connection string looks like:

postgresql://user:password@host:5432/dbname?sslmode=require

The parts:

postgresql:// -- the protocol (like https:// for web)
user:password -- authentication credentials
host:5432 -- the server address and port (5432 is Postgres's default)
dbname -- the specific database on the server
sslmode=require -- enforce encrypted connections

<aside> ⚠️ Connection strings contain passwords. Never commit them to git. Always use environment variables, as you learned in Week 2 and Week 5.

</aside>

Connect from Python

To talk to PostgreSQL from Python, you use psycopg2, the most widely used Python adapter for PostgreSQL. It translates Python function calls into the PostgreSQL wire protocol so you can run SQL queries directly from your code.

Install it with pip:

pip install psycopg2-binary

<aside> 💡 psycopg2-binary includes a bundled Postgres client library so it installs without extra system dependencies. This is fine for learning and local development. In production Docker images, teams often use psycopg2 (without -binary) with the system libpq package for better security update coverage.

</aside>

Connect and run a query:

import psycopg2

conn = psycopg2.connect(
    host="your-server.postgres.database.azure.com",
    dbname="weather",
    user="pipeline_user",
    password="<PASSWORD>",  # use the credentials your teacher provided
    sslmode="require",
)

cur = conn.cursor()
cur.execute("SELECT version()")
print(cur.fetchone())

cur.close()
conn.close()

You can also connect using a connection string from an environment variable, which is the pattern your pipeline will use:

import os
import psycopg2

conn = psycopg2.connect(os.environ["POSTGRES_URL"])

This is cleaner and matches how you will pass configuration to your container in Chapter 5.

<aside> 💡 Your teacher will provide the POSTGRES_URL value. Set it as an environment variable before running the code: export POSTGRES_URL="postgresql://user:password@host:5432/dbname?sslmode=require". In Chapter 5 you will see how to retrieve it from Azure Key Vault.

</aside>

Once you have the connection string set as an environment variable, try connecting:

<aside> ⌨️ Hands on: Connect to the shared Postgres server using the credentials your teacher provided. Run SELECT version() to verify the connection works and see which Postgres version is running.

</aside>

Creating tables

The SQL for creating tables is nearly identical to SQLite. The main difference is that Postgres has stricter and richer data types:

def create_weather_table(conn):
    """Create the weather_readings table if it does not exist."""
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS weather_readings (
            id SERIAL PRIMARY KEY,
            station TEXT NOT NULL,
            timestamp TIMESTAMPTZ NOT NULL,
            temperature_c DOUBLE PRECISION NOT NULL,
            humidity_pct INTEGER NOT NULL,
            ingested_at TIMESTAMPTZ DEFAULT NOW(),
            UNIQUE(station, timestamp)
        )
    """)
    conn.commit()
    cur.close()

Key differences from SQLite:

SERIAL instead of INTEGER PRIMARY KEY AUTOINCREMENT -- Postgres auto-generates IDs with SERIAL
TIMESTAMPTZ instead of TEXT for timestamps -- Postgres has native timestamp types with timezone support
DOUBLE PRECISION instead of REAL -- same concept, different name
NOW() instead of CURRENT_TIMESTAMP -- both work, but NOW() is more common in Postgres

Inserting rows and upserts

Inserting rows works the same as SQLite, but Postgres uses %s as the placeholder instead of ?:

def insert_readings(conn, readings):
    """Insert weather readings into the database."""
    cur = conn.cursor()

    for r in readings:
        cur.execute(
            """
            INSERT INTO weather_readings (station, timestamp, temperature_c, humidity_pct)
            VALUES (%s, %s, %s, %s)
            ON CONFLICT (station, timestamp) DO UPDATE SET
                temperature_c = EXCLUDED.temperature_c,
                humidity_pct = EXCLUDED.humidity_pct,
                ingested_at = NOW()
            """,
            (r["station"], r["timestamp"], r["temperature_c"], r["humidity_pct"]),
        )

    conn.commit()
    cur.close()

The upsert pattern (ON CONFLICT ... DO UPDATE SET) works the same as in SQLite. Notice that Postgres uses EXCLUDED (uppercase) to refer to the rejected row.

<aside> ⚠️ Postgres uses %s for parameterized queries, not ? like SQLite. This is a psycopg2 convention. The security principle is the same: never use f-strings or string formatting for SQL.

</aside>

Context managers: avoiding connection leaks

In Week 3 you learned to use with statements for SQLite connections. The same pattern works with psycopg2 and is even more important here: a Postgres connection is a network resource. If your script crashes before calling conn.close(), the connection stays open on the server. Enough leaked connections and the server runs out of slots, blocking everyone.

The with statement guarantees cleanup even if an exception occurs:

import os
import psycopg2

with psycopg2.connect(os.environ["POSTGRES_URL"]) as conn:
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO weather_readings (station, timestamp, temperature_c, humidity_pct)
            VALUES (%s, %s, %s, %s)
        """, ("Copenhagen", "2024-01-15T10:00:00Z", 3.2, 82))
    conn.commit()

There is one important difference from SQLite: psycopg2's with conn does not close the connection when the block ends. It commits on success and rolls back on exception, but the connection stays open. If you want the connection closed automatically, wrap it in a helper:

from contextlib import closing

with closing(psycopg2.connect(os.environ["POSTGRES_URL"])) as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT COUNT(*) FROM weather_readings")
        print(cur.fetchone()[0])
    conn.commit()
# conn is now closed

closing() from the standard library calls conn.close() when the block exits. This is the safest pattern for scripts that run once and stop (like a pipeline job). For your assignment, use this pattern so your container does not leave dangling connections on the shared server.

<aside> 💡 Summary of the patterns:

</aside>

Users and permissions

Your teacher has created a dedicated database user for your pipeline with only the permissions it needs (creating tables, inserting rows, querying data). This is called the principle of least privilege: give each user or service the minimum access it needs to do its job.

Why does this matter? If your pipeline's credentials leak, an attacker can only do what the pipeline user is allowed to do. If that user cannot drop databases or read other teams' data, the damage is limited.

In production, you would create separate roles per service:

pipeline_user   → INSERT, SELECT, CREATE TABLE on pipeline database
dashboard_user  → SELECT only (read-only for reporting)
admin           → Full access (used only for maintenance)

<aside> ⚠️ Do not use the admin user for applications. Your teacher has created a least-privilege user for your pipeline.

</aside>

Verifying your data

After inserting rows, verify them with a query:

cur = conn.cursor()

# Count rows
cur.execute("SELECT COUNT(*) FROM weather_readings")
count = cur.fetchone()[0]
print(f"Total rows: {count}")

# Sample recent rows
cur.execute("""
    SELECT station, timestamp, temperature_c
    FROM weather_readings
    ORDER BY ingested_at DESC
    LIMIT 5
""")
for row in cur.fetchall():
    print(row)

cur.close()

Working with psql

psql is the official command-line client for PostgreSQL. It ships with every Postgres installation and is the tool that database administrators and data engineers reach for first when they need to inspect a database quickly.

Why use psql instead of Python?

Speed: you can run a query in seconds without writing a script. No imports, no connection boilerplate, no file to save.
Exploration: when you do not know what tables exist or what a column contains, psql lets you poke around interactively.