1. Introduction to Containers and CI/CD
<aside> 🔀 🔍 View diff version (with change highlighting)
</aside>
You have built ingestion and transformation pipelines that work on your machine. The next step is to make them work reliably on any machine: your teammate's laptop, a CI runner, or Azure. That is where containers and CI/CD come in.
By the end of this chapter, you should understand why containerization matters for data engineering, what CI/CD actually does for your pipeline, and how the rest of Week 5 fits together.
A pipeline can fail outside your laptop for small, invisible reasons: a different Python version, a missing system dependency, or a library upgrade. These issues are expensive in data engineering because a broken pipeline means missing data, late reports, and angry stakeholders.
<aside> 💡 A reproducible pipeline is not just code: it is code plus the environment that runs it.
</aside>
An image is a template: an immutable snapshot of your app and its dependencies. You build it with docker build. It stays on disk until you delete it.
A container is a running instance of that image. You start it with docker run. When it exits, the container stops, but the image remains. So you build an image once, then run it anywhere as a container.
Containers are lighter than virtual machines (VMs). A VM runs a full operating system; a container shares the host machine's kernel (the core of the operating system that manages CPU, memory, and devices) and only isolates your app and its dependencies. That makes containers faster to start and easier to reproduce across machines.
<aside> 🤓 Curious Geek: Containers vs virtual machines
Want to understand why sharing the kernel makes containers lighter than VMs? The Docker docs explain the difference in What is a container?.
</aside>
In data engineering, reproducibility is critical. A pipeline that ran last month might fail today because someone upgraded pandas or the Python version changed. Containers pin the exact Python version, library versions, and system dependencies. When you run the same image in local dev, CI, and production, you get the same behavior everywhere.
<aside> 💡 Containers ensure identical Python, libraries, and dependencies in local, CI, and production environments.
</aside>
# Build an image from a Dockerfile
docker build -t weather-pipeline:1.0 .
# Run it with environment variables
docker run --rm -e API_KEY="redacted" weather-pipeline:1.0
Continuous Integration (CI) runs automated checks on every push. Continuous Deployment (CD) moves verified builds into production. For data pipelines, CI/CD typically includes:
In a web app, broken CI usually means a failed deployment. In a data pipeline, it can mean corrupt or missing data that goes unnoticed and affects decisions.
If you are unsure what to automate first, you can use an LLM to draft a CI checklist.
<aside> 💡 Using AI to help: Ask an LLM to suggest a minimal CI pipeline for a Python data project. (⚠️ Ensure no PII or sensitive company data is included!)
</aside>
Use the suggestion as a starting point, then adapt it to your pipeline.
<aside>
📘 Core program connection: You already covered Docker basics in the Core program: what containers are, Dockerfiles, and docker run. In the Docker Fundamentals chapter later this week we go deeper on what matters for data pipelines specifically: lean base images, layer caching for fast CI builds, and tagging images for a registry. Refresh here: Continuous integration
</aside>
The key difference in data engineering: a broken container means a missed pipeline run, not just a failed page load. Reproducibility is non-negotiable.
A typical delivery flow looks like this:
<aside> ⌨️ Hands on: Write down the five steps above and map them to your Week 3 ingestion pipeline. Which steps are new for you?
</aside>
That mapping makes it clear where containers and CI/CD add value.
<aside> 🤓 Curious Geek: Docker wasn't the first
Docker popularized containers, but the idea existed earlier. LXC (Linux Containers) and Solaris Zones provided process and filesystem isolation before Docker. Docker made the experience simple and built the image-and-registry workflow we use today. On Linux, the underlying mechanism is namespaces (isolation) and cgroups (resource limits). For a deeper dive, see Red Hat: What are Linux namespaces and cgroups?.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.