Week 5 - Containers & CI/CD

Intro: Containers and CI/CD

Dependency Management

Docker Fundamentals

Azure Container Registry

Python CI Pipeline

Practice

Assignment: Containerize and Ship

Gotchas & Pitfalls

Slides (PDF)

Career relevance: Week 5

Glossary: Week 5

Going Further

History of Containers and CI/CD

Intro: Containers and CI/CD

You have built ingestion and transformation pipelines that work on your machine. The next step is to make them work reliably on any machine: your teammate's laptop, a CI runner, or Azure. That is where containers and CI/CD come in.

By the end of this chapter, you should understand why containerization matters for data engineering, what CI/CD actually does for your pipeline, and how the rest of Week 5 fits together.

This is your first hands-on touch of Azure, and the ramp is deliberately gentle: you spend most of the week on the local Docker and CI skills you are already building, and there is very little new cloud to learn. You sign in with az acr login, then push a finished image to a registry your teacher has already set up.

Concepts

The "works on my machine" problem

A pipeline can fail outside your laptop for small, invisible reasons: a different Python version, a missing system dependency, or a library upgrade. These issues are expensive in data engineering because a broken pipeline means missing data, late reports, and angry stakeholders.

<aside> ๐Ÿ’ก A reproducible pipeline is not just code: it is code plus the environment that runs it.

</aside>

Images and containers

An image is a template: an immutable snapshot of your app and its dependencies. You build it with docker build. It stays on disk until you delete it.

A container is a running instance of that image. You start it with docker run. When it exits, the container stops, but the image remains. So you build an image once, then run it anywhere as a container.

Containers are lighter than virtual machines (VMs). A VM runs a full operating system; a container shares the host machine's kernel (the core of the operating system that manages CPU, memory, and devices) and only isolates your app and its dependencies. That makes containers faster to start and easier to reproduce across machines.

<aside> ๐Ÿค“ Curious Geek: Containers vs virtual machines

Want to understand why sharing the kernel makes containers lighter than VMs? The Docker docs explain the difference in What is a container?.

</aside>

Containers in data engineering

In data engineering, reproducibility is critical. A pipeline that ran last month might fail today because someone upgraded pandas or the Python version changed. Containers pin the exact Python version, library versions, and system dependencies. When you run the same image in local dev, CI, and production, you get the same behavior everywhere.

<aside> ๐Ÿ’ก Containers ensure identical Python, libraries, and dependencies in local, CI, and production environments.

</aside>

# Build an image from a Dockerfile
docker build -t weather-pipeline:1.0 .

# Run it with environment variables
docker run --rm -e API_KEY="redacted" weather-pipeline:1.0

CI/CD for data projects

Continuous Integration (CI) runs automated checks on every push. Continuous Deployment (CD) moves verified builds into production. For data pipelines, CI/CD typically includes:

In a web app, broken CI usually means a failed deployment. In a data pipeline, it can mean corrupt or missing data that goes unnoticed and affects decisions.

<aside> ๐Ÿ“˜ Core program connection: In the Core program Systems week you covered CI, CD, and continuous deployment as a connected workflow. This week applies that same model to Python data pipelines. Refresh here: Core Program - Continuous integration

</aside>

If you are unsure what to automate first, you can use an LLM to draft a CI checklist.

<aside> ๐Ÿ’ก Using AI to help: Ask an LLM to suggest a minimal CI pipeline for a Python data project. (โš ๏ธ Ensure no PII or sensitive company data is included!)

</aside>

Use the suggestion as a starting point, then adapt it to your pipeline.

Containers and data pipelines

<aside> ๐Ÿ“˜ Core program connection: You already covered Docker basics in the Core program: what containers are, Dockerfiles, and docker run. In the Docker Fundamentals chapter later this week we go deeper on what matters for data pipelines specifically: lean base images, layer caching for fast CI builds, and tagging images for a registry. Refresh here: Core Program - Docker

</aside>

The key difference in data engineering: a broken container means a missed pipeline run, not just a failed page load. Reproducibility is non-negotiable.

The container delivery flow

A typical delivery flow looks like this:

  1. Write or update pipeline code.
  2. Run tests locally.
  3. Build a container image.
  4. Push the image to a registry.
  5. Deploy the image to a cloud service.

Steps 1 and 2 use skills you already have from earlier in the track. In Week 2 you wrote unit tests with pytest (Week 2 - Testing with pytest) and set up ruff for linting (Week 2 - Linting and Formatting). CI runs those same checks automatically.

<aside> ๐Ÿ“˜ Core program connection: In the Core program you covered unit testing in JavaScript (Core Program - Unit testing). Python's pytest follows the same pattern: write test functions, run them with a command, and read the pass/fail summary. The testing concepts transfer directly.

</aside>

With that context, map the delivery flow to your own pipeline.

<aside> โŒจ๏ธ Hands on: Write down the five steps above and map them to your Week 3 ingestion pipeline. Which steps are new for you?

</aside>

That mapping makes it clear where containers and CI/CD add value.

<aside> ๐Ÿ“˜ Core program connection: CI pipelines trigger on git push and pull requests, using the same branch-and-merge workflow you practiced in the Core program. If the git branching model feels rusty, review it here: Core Program - Git branches

</aside>

Containers themselves have a longer history than Docker suggests.

<aside> ๐Ÿค“ Curious Geek: Docker wasn't the first

Docker popularized containers, but the idea existed earlier. LXC (Linux Containers) and Solaris Zones provided process and filesystem isolation before Docker. Docker made the experience simple and built the image-and-registry workflow we use today. On Linux, the underlying mechanism is namespaces (isolation) and cgroups (resource limits). For a deeper dive into the namespace types Docker uses, see Red Hat: The 7 most used Linux namespaces.

For the full arc from chroot (1979) to Kubernetes, see the optional History of Containers and CI/CD page.

</aside>

Exercises

  1. Explain, in one paragraph, why a data pipeline that works locally might fail in CI.
  2. List three CI checks that are useful for your pipeline. For each, explain what bug it would catch.
  3. Sketch the delivery flow from code commit to Azure deployment for this week.

Knowledge Check

<aside> ๐Ÿš€ Try it in the widget: Interactive Quiz: Introduction to Containers and CI/CD

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_5_ch1_intro_containers_cicd&embed=1

If images, containers, and the difference from virtual machines felt abstract, this crash course walks through it from the ground up.

<aside> ๐ŸŽฌ Struggling with this concept? Watch this beginner-friendly video:

Docker Crash Course - For Absolute Beginners | NeuralNine

</aside>

https://www.youtube.com/watch?v=XQNv0SRB0OM

Extra reading

<aside> ๐Ÿ“š For full courses, videos, and community resources on containers and CI/CD, see the optional Going Further page.

</aside>

Ready to apply the concepts? Containerize a working pipeline end to end in the first practice exercise.

<aside> โŒจ๏ธ Hands on: Practice with Exercise 1: Minimal Pipeline to Container, where you take a short Python script and containerize it from scratch using the delivery flow from this chapter.

</aside>


Next up: Dependency Management, where you pin exact Python dependencies with requirements.txt or uv so the image you build later installs the same packages on your laptop, in CI, and in production.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.