Week 6 - Cloud and Azure Essentials

Introduction to Cloud and Azure

Azure CLI and the Portal

Azure Blob Storage

Azure PostgreSQL Databases

Azure Container Apps Jobs

Cost Awareness

History of Cloud Computing

Week 6 Gotchas & Pitfalls

Practice

Week 6 Assignment: Deploy Your Pipeline to Azure

Week 6 Lesson Plan

History of Cloud Computing

This page is optional. It covers the history behind the cloud services you use this week. If you prefer to focus on the hands-on work, skip this and come back when you are curious.

Managing the cloud: portals, APIs, and CLIs

Background for Chapter 2: Azure CLI and the Portal

The earliest cloud services (mid-2000s) were managed entirely through web portals: click a button, get a VM. That worked for small setups, but clicking through forms does not scale. You cannot reproduce a portal session, share it with a colleague, or run it in a CI pipeline.

Cloud providers exposed their services through REST APIs first. Amazon's EC2 API (2006) let you create and destroy VMs with HTTP calls. Every cloud service you use today, from storage to databases to container jobs, is ultimately an HTTP API under the hood.

CLIs came next as wrappers around those APIs. The AWS CLI launched in 2013. Google Cloud followed with gcloud. Microsoft released the Azure CLI (az) in 2017, rewriting an earlier Node.js-based tool in Python for better cross-platform support. The az CLI is open source and translates every command into REST API calls (you can see them with --debug).

The pattern across all clouds is the same: <cli> <service> <action> --flags. Learn one, and the others feel familiar.

Infrastructure as Code (IaC) tools like Terraform (2014) and Bicep (Azure, 2020) took this further: define your entire infrastructure in files, version them in git, and deploy with a single command. You will work with Bicep later in this track.

Object storage: from file servers to S3

Background for Chapter 3: Azure Blob Storage

Before cloud storage, companies ran their own file servers and NAS (Network Attached Storage) devices. You bought hardware, set up RAID arrays for redundancy, managed backups, and worried about disk failures. Scaling meant buying more hardware and connecting it to the network. If a disk died, you restored from tape backups (if they worked).

Amazon launched S3 (Simple Storage Service) in March 2006, and it changed how companies think about storing data. S3 introduced the idea that storage could be a service: you upload objects (files), the provider handles replication, durability, and scaling. You pay only for what you use. S3 stores data across multiple devices in multiple facilities, achieving 99.999999999% (eleven nines) durability, meaning you are more likely to lose data to a meteor strike than to a storage failure.

S3 was one of the first AWS services ever launched and is widely considered the starting point of modern cloud computing. It grew from Amazon's internal need to store massive amounts of data reliably for their retail operations. If you want to see how it evolved over the years, check out this AWS S3 history and timeline.

Every major cloud provider now offers the same concept under different names:

Provider Service Launched
AWS S3 2006
Google Cloud Cloud Storage 2010
Microsoft Azure Blob Storage 2010

The APIs differ, but the mental model is identical: you put objects in, you get objects out. Objects are immutable (you replace the whole thing, not edit it in place), and the storage scales automatically. This "write once, read many times" pattern is exactly what data pipelines need.

Why object storage is so cheap: Object storage uses commodity hardware with built-in redundancy. Your file is automatically replicated across multiple disks (and optionally across regions). Because there is no file system overhead (no directories, no locks, no in-place edits), providers can offer storage at fractions of a cent per gigabyte per month.

Databases: from flat files to managed Postgres

Background for Chapter 4: Azure PostgreSQL Databases

The history of databases is a story of moving from "you manage everything" to "you write SQL, the platform handles the rest."

Flat files (1960s-70s): The earliest data storage was sequential files on tape. Programs read through entire files to find records. No querying, no indexing, no structure beyond what the programmer imposed.

Relational databases (1970s-80s): Edgar Codd published his relational model in 1970 at IBM. The idea: store data in tables with rows and columns, and use a declarative language (SQL) to query it. Oracle (1979), IBM DB2 (1983), and Microsoft SQL Server (1989) commercialized this model. These databases ran on expensive proprietary hardware and cost thousands of dollars in licenses.

Open source databases (1990s-2000s): MySQL (1995) and PostgreSQL (1996) made relational databases free and accessible. PostgreSQL started as an academic project at UC Berkeley (originally called "Postgres," short for "post-Ingres") and has been in continuous development for 30 years. It is consistently ranked as the most admired database in Stack Overflow's annual developer survey.

SQLite (2000): SQLite took a different approach: an embedded database that runs inside your application as a library, not as a separate server. No setup, no configuration, no network. It stores everything in a single file. SQLite is the most widely deployed database in the world: it runs on every smartphone, in every web browser, and in Python's standard library. You used it in Weeks 3 and 4.

Managed database services (2010s): Running a production database means handling backups, failover, patching, monitoring, and scaling. Cloud providers started offering managed versions: Amazon launched RDS (Relational Database Service) in 2009, supporting MySQL and later PostgreSQL. Google followed with Cloud SQL. Microsoft launched Azure Database for PostgreSQL in 2017. With a managed service, you get a running Postgres server with automatic backups, patching, and monitoring. You focus on writing SQL; the platform handles the infrastructure.

PostgreSQL at scale today: OpenAI runs PostgreSQL as the primary database behind ChatGPT. In their blog post Scaling PostgreSQL, they describe running clusters with hundreds of millions of rows, thousands of queries per second, and terabytes of data. Rather than switching to a specialized NoSQL database, they invested in tuning Postgres: connection pooling, read replicas, partitioned tables, and careful index management. Their conclusion: PostgreSQL scales further than most people expect.

Era Example You manage
Flat files CSV on tape Everything
Self-hosted RDBMS PostgreSQL on your server Hardware, OS, backups, Postgres
Managed RDBMS Azure Database for PostgreSQL Schema, queries, permissions
Embedded SQLite Nothing (it is a file)

Cloud pricing: from CapEx to pay-per-use

Background for Chapter 6: Cost Awareness

Before the cloud, running infrastructure meant capital expenditure (CapEx): you bought servers, storage arrays, and network gear upfront, then depreciated them over 3-5 years. If you needed more capacity, you placed a hardware order and waited weeks. If you overestimated, the extra hardware sat idle. Either way, the money was spent.

Cloud computing introduced operational expenditure (OpEx): pay only for what you use, when you use it. Amazon's EC2 launched in 2006 with per-hour billing. Over time, providers moved to finer granularity: per-minute (Google Compute Engine, 2013), per-second (AWS and Azure, 2017), and per-millisecond (AWS Lambda, 2014). The trend is clear: the smaller the billing unit, the less you pay for idle time.

Reserved instances and commitments added a middle ground. Cloud providers noticed that many workloads run 24/7 and offered discounts (30-70%) for committing to 1-3 years of usage. This brings back some CapEx-style thinking: you trade flexibility for lower prices. Azure calls these Reserved Instances, AWS calls them Savings Plans.

Free tiers emerged as a way to attract developers. Azure offers 180,000 vCPU-seconds/month for Container Apps (enough for small pipelines to run for free), 5 GB of blob storage, and 750 hours of B1ms compute for the first 12 months. AWS and Google Cloud have similar programs. These free tiers are why your Week 6 Container App Jobs cost nothing, even though they run real compute.

Pricing model How it works Best for
Pay-as-you-go Per second/hour, no commitment Experiments, variable workloads
Reserved instances 1-3 year commitment, 30-70% discount Steady-state production
Free tier Limited free usage per month Learning, small pipelines
Spot/preemptible Spare capacity at 60-90% discount, can be interrupted Batch jobs that can retry

The key insight for data engineers: your pipeline runs for 30 seconds and then stops. With per-second billing and a free tier, that costs nothing. But the Postgres server backing it runs 24/7 and costs ~13 EUR/month even when idle. Understanding which resources bill continuously versus on-demand is the core of Chapter 6.

From bare metal to serverless

Background for Chapter 5: Azure Container Apps Jobs

Computing infrastructure has gone through four major shifts. Each step handed more responsibility to the platform and less to you.

1. Bare metal

You buy or rent a physical server. You install the OS, runtime, and your code. You handle hardware failures, security patches, and networking. The machine sits in a datacenter running 24/7 whether you use it or not.

For decades, this was the only option. Companies built entire teams around managing racks of servers.

2. Virtual machines (VMs)

Virtualization lets one physical machine run multiple operating systems. Instead of buying hardware, you rent a VM from a cloud provider.

Amazon launched EC2 in 2006, making it possible to rent a VM by the hour. Microsoft followed with Azure Virtual Machines. You still manage the OS and runtime, but you skip the hardware. Cloud VMs can be created and destroyed on demand, which is a big improvement over owning hardware.

3. Containers

Docker (2013) introduced lightweight isolation: package your code, dependencies, and runtime into an image that runs the same everywhere. In Week 5, you built a Docker image for your pipeline.

Containers run on top of a VM, but they are smaller and faster to start than a full VM. You still need a VM underneath to run them. Google open-sourced Kubernetes (K8s) in 2014, an orchestration system for running containers across many VMs, but operating a K8s cluster is complex work in itself.

4. Serverless

You upload your code (or a container image), define a trigger, and the platform handles everything else: provisioning, execution, teardown. You pay only for the seconds your code runs.

AWS introduced Lambda in 2014 (run a function, pay per invocation). Microsoft followed with Azure Functions in 2016. Managed container services like AWS Fargate, Google Cloud Run, and Azure Container Apps (2022) are built on top of Kubernetes but hide that complexity from you.

The term "serverless" is misleading: servers still exist, but the cloud provider manages them. You never see or configure those servers.

Managed vs serverless

In practice, most cloud infrastructure falls somewhere between "fully managed" and "serverless." The distinction matters:

Many real-world architectures combine both: a serverless Container App Job (runs for 30 seconds, costs nothing) writing to a managed Postgres database (runs 24/7, costs ~€13/month). Understanding which parts of your pipeline are serverless and which are managed is key to estimating costs (see Chapter 6).

Comparison

Bare metal VM Container on a VM Serverless
You manage Everything OS, runtime, code Docker, runtime, code Code and config
Runs 24/7 24/7 (or on-demand) When you start it Only when triggered
Cost Fixed hardware Per hour Per hour (VM underneath) Per second of execution
Start-up time Days/weeks Minutes Seconds Seconds

<aside> 🎬 Animation: the four steps, from bare metal to serverless

</aside>

Each step removed a layer of infrastructure you had to manage yourself. In this week's chapters, you work at step 4: you provide a container image, and Azure Container Apps handles the rest.

The three major cloud providers

Three providers dominate the cloud market. Their services overlap significantly: if you learn to deploy containers, manage databases, and use object storage on one, the concepts transfer directly to the others.

| --- | --- | --- | --- | --- | --- |

AWS was first and has the largest global market share. Azure dominates in enterprise, particularly in Europe and the Netherlands, where existing Microsoft contracts (Office 365, Active Directory) make adoption straightforward. GCP is popular in data-heavy and ML-focused organizations, partly because of BigQuery and TensorFlow.

This course uses Azure because of its strong position in Dutch business. The skills you build (working with CLIs, deploying containers, connecting to managed databases, estimating costs) apply across all three providers. Each cloud has its own naming conventions and gotchas, but the underlying patterns are the same.

Extra reading