1. Introduction to Containers and CI/CD
<aside> 🔀 📄 View clean version (no diff highlighting)
</aside>
+51 blocks added, -10 blocks removed
Content coming soon...
You have built ingestion and transformation pipelines that work on your machine. The next step is to make them work reliably on any machine: your teammate's laptop, a CI runner, or Azure. That is where containers and CI/CD come in.
By the end of this chapter, you should understand why containerization matters for data engineering, what CI/CD actually does for your pipeline, and how the rest of Week 5 fits together.
- Why "it works on my machine" is a real problem in data engineering
- What containers are: isolated, reproducible environments for running code
- Containers vs virtual machines: lightweight isolation vs full OS emulation
- What CI/CD means: Continuous Integration and Continuous Deployment
- How CI/CD applies to data projects: automated testing, linting, and deployment of pipelines
- The container workflow: build image, test locally, push to registry, deploy to cloud
- Overview of Docker, GitHub Actions, and Azure Container Registry
- How this week connects to later Azure deployment in Week 6
A pipeline can fail outside your laptop for small, invisible reasons: a different Python version, a missing system dependency, or a library upgrade. These issues are expensive in data engineering because a broken pipeline means missing data, late reports, and angry stakeholders.
<aside> 💡 A reproducible pipeline is not just code: it is code plus the environment that runs it.
</aside>
An image is a template: an immutable snapshot of your app and its dependencies. You build it with docker build. It stays on disk until you delete it.
A container is a running instance of that image. You start it with docker run. When it exits, the container stops, but the image remains. So you build an image once, then run it anywhere as a container.
Containers are lighter than virtual machines (VMs). A VM runs a full operating system; a container shares the host machine's kernel (the core of the operating system that manages CPU, memory, and devices) and only isolates your app and its dependencies. That makes containers faster to start and easier to reproduce across machines.
<aside> 🤓 Curious Geek: Containers vs virtual machines
Want to understand why sharing the kernel makes containers lighter than VMs? The Docker docs explain the difference in What is a container?.
</aside>
In data engineering, reproducibility is critical. A pipeline that ran last month might fail today because someone upgraded pandas or the Python version changed. Containers pin the exact Python version, library versions, and system dependencies. When you run the same image in local dev, CI, and production, you get the same behavior everywhere.
<aside> 💡 Containers ensure identical Python, libraries, and dependencies in local, CI, and production environments.
</aside>
# Build an image from a Dockerfile
docker build -t weather-pipeline:1.0 .
# Run it with environment variables
docker run --rm -e API_KEY="redacted" weather-pipeline:1.0
Continuous Integration (CI) runs automated checks on every push. Continuous Deployment (CD) moves verified builds into production. For data pipelines, CI/CD typically includes:
Linting and formatting checks
Unit tests for transforms and validators
A build step that creates a container image
A push step that publishes the image to a registry
In a web app, broken CI usually means a failed deployment. In a data pipeline, it can mean corrupt or missing data that goes unnoticed and affects decisions.
If you are unsure what to automate first, you can use an LLM to draft a CI checklist.
<aside> 💡 Using AI to help: Ask an LLM to suggest a minimal CI pipeline for a Python data project. (⚠️ Ensure no PII or sensitive company data is included!)
</aside>
Use the suggestion as a starting point, then adapt it to your pipeline.
<aside>
📘 Core program connection: You already covered Docker basics in the Core program: what containers are, Dockerfiles, and docker run. In the Docker Fundamentals chapter later this week we go deeper on what matters for data pipelines specifically: lean base images, layer caching for fast CI builds, and tagging images for a registry. Refresh here: Continuous integration
</aside>
The key difference in data engineering: a broken container means a missed pipeline run, not just a failed page load. Reproducibility is non-negotiable.
A typical delivery flow looks like this:
Write or update pipeline code.
Run tests locally.
Build a container image.
Push the image to a registry.
Deploy the image to a cloud service.
<aside> ⌨️ Hands on: Write down the five steps above and map them to your Week 3 ingestion pipeline. Which steps are new for you?
</aside>
That mapping makes it clear where containers and CI/CD add value.
<aside> 🤓 Curious Geek: Docker wasn't the first
Docker popularized containers, but the idea existed earlier. LXC (Linux Containers) and Solaris Zones provided process and filesystem isolation before Docker. Docker made the experience simple and built the image-and-registry workflow we use today. On Linux, the underlying mechanism is namespaces (isolation) and cgroups (resource limits). For a deeper dive, see Red Hat: What are Linux namespaces and cgroups?.
</aside>
Explain, in one paragraph, why a data pipeline that works locally might fail in CI.
List three CI checks that are useful for your pipeline. For each, explain what bug it would catch.
Sketch the delivery flow from code commit to Azure deployment for this week.
What is the difference between a Docker image and a container?
Why do containers reduce the \"works on my machine\" problem for data pipelines?
What is the difference between CI and CD in this week’s workflow?
Which step in the container delivery flow would you automate first, and why?
Docker Get Started: official tutorial for Docker basics
GitHub Actions documentation: guides for building CI/CD workflows
The Twelve-Factor App: X. Dev/prod parity: the principle behind why containers matter for data pipelines
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.