Week 5 - Containers and CI/CD for Data Projects

Teachers

<aside> 🔀 🔍 View diff version (with change highlighting)

</aside>

Introduction to Containers and CI/CD

You have built ingestion and transformation pipelines that work on your machine. The next step is to make them work reliably on any machine: your teammate's laptop, a CI runner, or Azure. That is where containers and CI/CD come in.

By the end of this chapter, you should understand why containerization matters for data engineering, what CI/CD actually does for your pipeline, and how the rest of Week 5 fits together.

Concepts

The "works on my machine" problem

A pipeline can fail outside your laptop for small, invisible reasons: a different Python version, a missing system dependency, or a library upgrade. These issues are expensive in data engineering because a broken pipeline means missing data, late reports, and angry stakeholders.

<aside> 💡 A reproducible pipeline is not just code: it is code plus the environment that runs it.

</aside>

Images and containers

An image is a template: an immutable snapshot of your app and its dependencies. You build it with docker build. It stays on disk until you delete it.

A container is a running instance of that image. You start it with docker run. When it exits, the container stops, but the image remains. So you build an image once, then run it anywhere as a container.

Containers are lighter than virtual machines (VMs). A VM runs a full operating system; a container shares the host machine's kernel (the core of the operating system that manages CPU, memory, and devices) and only isolates your app and its dependencies. That makes containers faster to start and easier to reproduce across machines.

<aside> 🤓 Curious Geek: Containers vs virtual machines

Want to understand why sharing the kernel makes containers lighter than VMs? The Docker docs explain the difference in What is a container?.

</aside>

Why containers matter for data projects

In data engineering, reproducibility is critical. A pipeline that ran last month might fail today because someone upgraded pandas or the Python version changed. Containers pin the exact Python version, library versions, and system dependencies. When you run the same image in local dev, CI, and production, you get the same behavior everywhere.

<aside> 💡 Containers ensure identical Python, libraries, and dependencies in local, CI, and production environments.

</aside>

# Build an image from a Dockerfile
docker build -t weather-pipeline:1.0 .

# Run it with environment variables
docker run --rm -e API_KEY="redacted" weather-pipeline:1.0

CI/CD for data projects

Continuous Integration (CI) runs automated checks on every push. Continuous Deployment (CD) moves verified builds into production. For data pipelines, CI/CD typically includes:

Linting and formatting checks
Unit tests for transforms and validators
A build step that creates a container image
A push step that publishes the image to a registry

In a web app, broken CI usually means a failed deployment. In a data pipeline, it can mean corrupt or missing data that goes unnoticed and affects decisions.

If you are unsure what to automate first, you can use an LLM to draft a CI checklist.

<aside> 💡 Using AI to help: Ask an LLM to suggest a minimal CI pipeline for a Python data project. (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Use the suggestion as a starting point, then adapt it to your pipeline.

Containers and data pipelines

<aside> 📘 Core program connection: You already covered Docker basics in the Core program: what containers are, Dockerfiles, and docker run. In the Docker Fundamentals chapter later this week we go deeper on what matters for data pipelines specifically: lean base images, layer caching for fast CI builds, and tagging images for a registry. Refresh here: Continuous integration

</aside>

The key difference in data engineering: a broken container means a missed pipeline run, not just a failed page load. Reproducibility is non-negotiable.

The container delivery flow

A typical delivery flow looks like this:

Write or update pipeline code.
Run tests locally.
Build a container image.
Push the image to a registry.
Deploy the image to a cloud service.

<aside> ⌨️ Hands on: Write down the five steps above and map them to your Week 3 ingestion pipeline. Which steps are new for you?

</aside>

That mapping makes it clear where containers and CI/CD add value.

<aside> 🤓 Curious Geek: Docker wasn't the first

Docker popularized containers, but the idea existed earlier. LXC (Linux Containers) and Solaris Zones provided process and filesystem isolation before Docker. Docker made the experience simple and built the image-and-registry workflow we use today. On Linux, the underlying mechanism is namespaces (isolation) and cgroups (resource limits). For a deeper dive, see Red Hat: What are Linux namespaces and cgroups?.

</aside>

Exercises

Explain, in one paragraph, why a data pipeline that works locally might fail in CI.
List three CI checks that are useful for your pipeline. For each, explain what bug it would catch.
Sketch the delivery flow from code commit to Azure deployment for this week.

🧠 Knowledge Check

What is the difference between a Docker image and a container?
Why do containers reduce the \"works on my machine\" problem for data pipelines?
What is the difference between CI and CD in this week’s workflow?
Which step in the container delivery flow would you automate first, and why?

Extra reading

Docker Get Started: official tutorial for Docker basics
GitHub Actions documentation: guides for building CI/CD workflows
The Twelve-Factor App: X. Dev/prod parity: the principle behind why containers matter for data pipelines

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.