Week 5 - Containers and CI/CD for Data Projects

Overview

1. Introduction to Containers and CI/CD

2. Dependency Management

3. Docker Fundamentals

4. Azure Container Registry

5. Python CI Pipeline

6. Practice

7. Assignment

8. Gotchas & Pitfalls

Lesson plan

<aside> 🔀 📄 View clean version (no diff highlighting)

</aside>

+80 blocks added, -11 blocks removed


Dependency Management

Content coming soon...

Your pipeline is only as reliable as its dependencies. A pipeline that runs today can break tomorrow if a library updates or a teammate installs a different version. Dependency management solves that by making your Python environment reproducible.

Suggested Topics

This chapter compares requirements.txt and uv, then shows how to manage dependencies with either option. In Week 1 you learned the basics of virtual environments, package installs, and lock files. See: Python Setup

- Why dependency management matters: reproducibility across environments

- requirements.txt: creating, pinning versions, and installing with pip

- Poetry: pyproject.toml, lock files, and virtual environment management

- Choosing one approach and using it consistently across the project

- Pinning exact versions vs version ranges: trade-offs

- Separating production and development dependencies

- Common pitfalls: conflicting versions, missing transitive dependencies

- How dependency files feed into your Dockerfile

- Keeping dependencies up to date: manual review and Dependabot

Concepts

requirements.txt vs uv

In Week 1 you saw both approaches at a high level. Here we go deeper because this choice matters for Docker and CI/CD as well. Both approaches are valid, but they solve reproducibility at different levels:

A package manager decides how your project installs external libraries. In this chapter, the practical choice is between the classic pip plus requirements.txt workflow and the modern uv workflow with pyproject.toml and uv.lock.

For this track, uv is the recommended route because uv.lock pins the full dependency tree, including upstream dependencies that your direct packages pull in.

<aside> 💡 Pick one workflow and use it consistently across local dev, CI, and Docker. If you are starting fresh, prefer uv.

</aside>

<aside> 📘 Core program connection: In the Core program with JavaScript you used npm install, package.json, and package-lock.json to manage project dependencies. In the Data Track you solve the same problem in Python with requirements.txt or with pyproject.toml plus uv.lock. The goal is the same in both tracks: declare what your project needs, lock exact versions, and make installs reproducible across machines and CI. Refresh the Core program chapter here: https://www.notion.so/hackyourfuture/Package-managers-2b250f64ffc9800d8c76e5fec3aa8095

</aside>

Option A: requirements.txt (classic workflow)

This is the simplest and most portable approach. You list your direct packages and pin versions.

pandas==2.2.1
requests==2.31.0
pydantic==2.6.1

Then install with pip inside a virtual environment:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

This works well, but it does not give you the same built-in lock behavior for the full dependency tree that uv.lock provides.

Pinning your direct dependencies is a good start, but it does not pin their dependencies. You might list requests==2.31.0, but requests depends on urllib3. If urllib3 releases a breaking change, pip can pull in a newer version the next time someone runs pip install, even though your requirements.txt did not change.

# requirements.txt contains:
# requests==2.31.0
#
# urllib3 is not listed, so pip resolves a version at install time

import requests

response = requests.get("<https://api.example.com/data>")

Two teammates running pip install -r requirements.txt a week apart can end up with different environments. A broken CI run with no code change is often the first sign that this is happening.

<aside> ⚠️ Pinning top-level packages controls what you install directly. It does not fully control what those packages install underneath.

</aside>

So which approach should you choose?

<aside> 💡 Use requirements.txt when you are working in an existing codebase that already uses it. Use uv for new projects where you control the setup, because it gives you a proper lock file with no extra workflow.

</aside>

Option B: uv with pyproject.toml and uv.lock

uv uses pyproject.toml as the source of truth and writes exact versions, including transitive dependencies, to uv.lock.

[project]
name = "weather-pipeline"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
  "pandas==2.2.1",
  "requests==2.31.0",
]
# Install dependencies pinned in uv.lock
uv sync

# Run a command inside the managed environment
uv run python -m src.pipeline

<aside> 💡 Commit uv.lock. It is the record of the exact versions your CI and teammates should use.

</aside>

This is the main reason uv is recommended in this track: you get a faster workflow and stronger guarantees that CI, your laptop, and production install the same dependency graph.

Which package manager should you choose?

Use requirements.txt when:

Use uv when:

For this track, the recommendation is:

Runtime vs dev dependencies

Keep production dependencies separate from development tools like linters and tests. This keeps your container images smaller and your CI runs faster.

requirements.txt approach:

uv approach:

[project.optional-dependencies]
dev = [
  "pytest==8.2.0",
  "ruff==0.5.1",
]

<aside> ⚠️ If you install dev tools in production containers, you increase image size and risk extra vulnerabilities.

</aside>

Updating dependencies safely

A simple workflow:

  1. Update your dependency file (requirements.txt or pyproject.toml).

  2. Reinstall or sync.

  3. Commit the dependency files and lock file if you use uv.

<aside> ⌨️ Hands on: Add pydantic to your dependency list, install it, and confirm your pipeline still runs.

</aside>

That small habit keeps your CI consistent with local development.

What reproducible CI actually means

A reproducible CI run means the same commit should install the same dependency set every time the pipeline runs. That matters because CI is your safety check before a deploy. If CI changes underneath you because a dependency resolver picks a newer transitive package, the signal becomes noisy:

This does not mean requirements.txt is unusable. It means requirements.txt gives you a simpler workflow with weaker guarantees, while uv.lock gives you stronger guarantees with an extra lock file. That is why uv is recommended in this track for new projects, especially once CI and Docker are involved.

<aside> 💡 If you join an existing project that already uses requirements.txt, follow the project standard first. If you start a new project, prefer uv so your CI and local installs stay aligned more easily.

</aside>

The concept of deterministic builds applies across all languages and toolchains.

<aside> 🤓 Curious Geek: Lock files became standard

</aside>

Exercises

  1. Explain one advantage of requirements.txt and one advantage of uv.

  2. Add a dev dependency group and describe when you would install it.

  3. Identify one dependency in your pipeline in week 3 or 4 that should be pinned and explain why.

🧠 Knowledge Check

  1. What problem does a lock file solve for CI runs?

  2. Why is uv the recommended route in this track?

  3. Why should dev dependencies be kept out of production images?

Extra reading

<aside> 💡 If you chose uv, see the Dockerfile with uv section in the Docker Fundamentals chapter for the correct setup. The mechanics differ from requirements.txt, especially the --frozen flag.

</aside>