Week 6 - Cloud and Azure Essentials
Introduction to Cloud and Azure
Week 6 Assignment: Deploy Your Pipeline to Azure
In Weeks 3 and 4, your pipeline wrote to SQLite. That works locally, but SQLite is a file-based database: it does not support concurrent access and cannot run as a shared service. In the cloud, you need a real database server. Azure Database for PostgreSQL is a managed service that runs Postgres without you managing servers.
Your teacher has created a shared Postgres server for the class. You will connect to it, create tables in your own database, and write pipeline output there.
By the end of this chapter, you should be able to connect to Azure Postgres from Python, create tables, insert rows, and understand basic security and access patterns.
In Week 3 you used SQLite because it requires no setup: it is a single file on disk. SQLite is perfect for learning SQL, prototyping, and single-user pipelines. It is embedded in Python's standard library, runs everywhere, and powers billions of devices (every smartphone, every browser).
But SQLite has limits. It supports only one writer at a time. It cannot accept network connections, so other services cannot reach it. And when your container stops, the file is gone unless you mount a volume.
PostgreSQL is the next step up. It is a full client-server database: it runs as a separate process (or managed service), accepts connections over the network, and handles multiple users writing at the same time. PostgreSQL has been in active development since 1996 and is consistently ranked as the most popular database among professional developers in Stack Overflow's annual survey.
| SQLite | PostgreSQL | |
|---|---|---|
| Architecture | Embedded (library, no server) | Client-server (runs as a service) |
| Concurrency | Single writer at a time | Many concurrent readers and writers |
| Network access | Local file only | TCP connections from anywhere |
| Data types | Flexible (any column accepts any type) | Strict types, enforced at insert |
| Scale | Small to medium datasets | Terabytes, millions of rows per second |
| Managed cloud option | None (it is a file) | Azure, AWS RDS, Google Cloud SQL |
| Best for | Prototyping, local pipelines, embedded apps | Production pipelines, multi-user, cloud |
Both SQLite and PostgreSQL are relational databases: they store data in tables with rows and columns, and you query them with SQL. There are also non-relational (NoSQL) databases like MongoDB and Redis that are designed for different access patterns, but relational databases remain the standard for structured data in pipelines. For the full history of how databases evolved from flat files to managed services, see History of Cloud Computing.
The SQL you learned with SQLite carries over directly. CREATE TABLE, INSERT, SELECT, ON CONFLICT all work the same way. The main differences are the connection setup and the fact that Postgres enforces types more strictly.
<aside> 📘 Core program connection: You learned relational database fundamentals and SQL in the Core program. Here you apply the same ideas to a managed Postgres service. Refresher: https://www.notion.so/hackyourfuture/Introduction-to-databases-and-persistence-2fb50f64ffc981e683e0ddf822b57be5
</aside>
When you install Postgres on your own machine, you are responsible for everything: updates, backups, disk space, security patches, and monitoring. A managed database service like Azure Database for PostgreSQL handles all of that for you. You get a running Postgres server with automatic backups, patching, and monitoring. You focus on writing SQL; Azure handles the infrastructure.
This is the same pattern as other managed services in this track: Azure Blob Storage manages your files, Azure Container Apps manages your containers, and Azure Database for PostgreSQL manages your database.
To connect to a database, you need a connection string: a URL that contains the host, port, database name, username, and password.
Azure Postgres requires SSL by default, which encrypts the connection between your code and the database. A typical connection string looks like:
postgresql://user:password@host:5432/dbname?sslmode=require
The parts:
postgresql:// -- the protocol (like https:// for web)user:password -- authentication credentialshost:5432 -- the server address and port (5432 is Postgres's default)dbname -- the specific database on the serversslmode=require -- enforce encrypted connections<aside> ⚠️ Connection strings contain passwords. Never commit them to git. Always use environment variables, as you learned in Week 2 and Week 5.
</aside>
To talk to PostgreSQL from Python, you use psycopg2, the most widely used Python adapter for PostgreSQL. It translates Python function calls into the PostgreSQL wire protocol so you can run SQL queries directly from your code.
Install it with pip:
pip install psycopg2-binary
<aside>
💡 psycopg2-binary includes a bundled Postgres client library so it installs without extra system dependencies. This is fine for learning and local development. In production Docker images, teams often use psycopg2 (without -binary) with the system libpq package for better security update coverage.
</aside>
Connect and run a query:
import psycopg2
conn = psycopg2.connect(
host="your-server.postgres.database.azure.com",
dbname="weather",
user="pipeline_user",
password="<PASSWORD>", # use the credentials your teacher provided
sslmode="require",
)
cur = conn.cursor()
cur.execute("SELECT version()")
print(cur.fetchone())
cur.close()
conn.close()
You can also connect using a connection string from an environment variable, which is the pattern your pipeline will use:
import os
import psycopg2
conn = psycopg2.connect(os.environ["POSTGRES_URL"])
This is cleaner and matches how you will pass configuration to your container in Chapter 5.
<aside>
💡 Your teacher will provide the POSTGRES_URL value. Set it as an environment variable before running the code: export POSTGRES_URL="postgresql://user:password@host:5432/dbname?sslmode=require". In Chapter 5 you will see how to retrieve it from Azure Key Vault.
</aside>
Once you have the connection string set as an environment variable, try connecting:
<aside>
⌨️ Hands on: Connect to the shared Postgres server using the credentials your teacher provided. Run SELECT version() to verify the connection works and see which Postgres version is running.
</aside>
The SQL for creating tables is nearly identical to SQLite. The main difference is that Postgres has stricter and richer data types:
def create_weather_table(conn):
"""Create the weather_readings table if it does not exist."""
cur = conn.cursor()
cur.execute("""
CREATE TABLE IF NOT EXISTS weather_readings (
id SERIAL PRIMARY KEY,
station TEXT NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
temperature_c DOUBLE PRECISION NOT NULL,
humidity_pct INTEGER NOT NULL,
ingested_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(station, timestamp)
)
""")
conn.commit()
cur.close()
Key differences from SQLite:
SERIAL instead of INTEGER PRIMARY KEY AUTOINCREMENT -- Postgres auto-generates IDs with SERIALTIMESTAMPTZ instead of TEXT for timestamps -- Postgres has native timestamp types with timezone supportDOUBLE PRECISION instead of REAL -- same concept, different nameNOW() instead of CURRENT_TIMESTAMP -- both work, but NOW() is more common in PostgresInserting rows works the same as SQLite, but Postgres uses %s as the placeholder instead of ?:
def insert_readings(conn, readings):
"""Insert weather readings into the database."""
cur = conn.cursor()
for r in readings:
cur.execute(
"""
INSERT INTO weather_readings (station, timestamp, temperature_c, humidity_pct)
VALUES (%s, %s, %s, %s)
ON CONFLICT (station, timestamp) DO UPDATE SET
temperature_c = EXCLUDED.temperature_c,
humidity_pct = EXCLUDED.humidity_pct,
ingested_at = NOW()
""",
(r["station"], r["timestamp"], r["temperature_c"], r["humidity_pct"]),
)
conn.commit()
cur.close()
The upsert pattern (ON CONFLICT ... DO UPDATE SET) works the same as in SQLite. Notice that Postgres uses EXCLUDED (uppercase) to refer to the rejected row.
<aside>
⚠️ Postgres uses %s for parameterized queries, not ? like SQLite. This is a psycopg2 convention. The security principle is the same: never use f-strings or string formatting for SQL.
</aside>
In Week 3 you learned to use with statements for SQLite connections. The same pattern works with psycopg2 and is even more important here: a Postgres connection is a network resource. If your script crashes before calling conn.close(), the connection stays open on the server. Enough leaked connections and the server runs out of slots, blocking everyone.
The with statement guarantees cleanup even if an exception occurs:
import os
import psycopg2
with psycopg2.connect(os.environ["POSTGRES_URL"]) as conn:
with conn.cursor() as cur:
cur.execute("""
INSERT INTO weather_readings (station, timestamp, temperature_c, humidity_pct)
VALUES (%s, %s, %s, %s)
""", ("Copenhagen", "2024-01-15T10:00:00Z", 3.2, 82))
conn.commit()
There is one important difference from SQLite: psycopg2's with conn does not close the connection when the block ends. It commits on success and rolls back on exception, but the connection stays open. If you want the connection closed automatically, wrap it in a helper:
from contextlib import closing
with closing(psycopg2.connect(os.environ["POSTGRES_URL"])) as conn:
with conn.cursor() as cur:
cur.execute("SELECT COUNT(*) FROM weather_readings")
print(cur.fetchone()[0])
conn.commit()
# conn is now closed
closing() from the standard library calls conn.close() when the block exits. This is the safest pattern for scripts that run once and stop (like a pipeline job). For your assignment, use this pattern so your container does not leave dangling connections on the shared server.
<aside> 💡 Summary of the patterns:
</aside>
Your teacher has created a dedicated database user for your pipeline with only the permissions it needs (creating tables, inserting rows, querying data). This is called the principle of least privilege: give each user or service the minimum access it needs to do its job.
Why does this matter? If your pipeline's credentials leak, an attacker can only do what the pipeline user is allowed to do. If that user cannot drop databases or read other teams' data, the damage is limited.
In production, you would create separate roles per service:
pipeline_user → INSERT, SELECT, CREATE TABLE on pipeline database
dashboard_user → SELECT only (read-only for reporting)
admin → Full access (used only for maintenance)
<aside> ⚠️ Do not use the admin user for applications. Your teacher has created a least-privilege user for your pipeline.
</aside>
After inserting rows, verify them with a query:
cur = conn.cursor()
# Count rows
cur.execute("SELECT COUNT(*) FROM weather_readings")
count = cur.fetchone()[0]
print(f"Total rows: {count}")
# Sample recent rows
cur.execute("""
SELECT station, timestamp, temperature_c
FROM weather_readings
ORDER BY ingested_at DESC
LIMIT 5
""")
for row in cur.fetchall():
print(row)
cur.close()
psql is the official command-line client for PostgreSQL. It ships with every Postgres installation and is the tool that database administrators and data engineers reach for first when they need to inspect a database quickly.
Why use psql instead of Python?
psql lets you poke around interactively.