Week 5 - Containers and CI/CD for Data Projects

Teachers

<aside> 🔀 📄 View clean version (no diff highlighting)

</aside>

+80 blocks added, -11 blocks removed

View raw diff

Dependency Management

~~Content coming soon...~~

Your pipeline is only as reliable as its dependencies. A pipeline that runs today can break tomorrow if a library updates or a teammate installs a different version. Dependency management solves that by making your Python environment reproducible.

Concepts

requirements.txt vs uv

In Week 1 you saw both approaches at a high level. Here we go deeper because this choice matters for Docker and CI/CD as well. Both approaches are valid, but they solve reproducibility at different levels:

A package manager decides how your project installs external libraries. In this chapter, the practical choice is between the classic pip plus requirements.txt workflow and the modern uv workflow with pyproject.toml and uv.lock.

requirements.txt: simple, widely supported, minimal tooling
uv: fast installs, modern pyproject.toml workflows, lock files built in, stronger reproducibility for transitive dependencies

For this track, uv is the recommended route because uv.lock pins the full dependency tree, including upstream dependencies that your direct packages pull in.

<aside> 💡 Pick one workflow and use it consistently across local dev, CI, and Docker. If you are starting fresh, prefer uv.

</aside>

<aside> 📘 Core program connection: In the Core program with JavaScript you used npm install, package.json, and package-lock.json to manage project dependencies. In the Data Track you solve the same problem in Python with requirements.txt or with pyproject.toml plus uv.lock. The goal is the same in both tracks: declare what your project needs, lock exact versions, and make installs reproducible across machines and CI. Refresh the Core program chapter here: https://www.notion.so/hackyourfuture/Package-managers-2b250f64ffc9800d8c76e5fec3aa8095

</aside>

Option A: requirements.txt (classic workflow)

This is the simplest and most portable approach. You list your direct packages and pin versions.

pandas==2.2.1
requests==2.31.0
pydantic==2.6.1

Then install with pip inside a virtual environment:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

This works well, but it does not give you the same built-in lock behavior for the full dependency tree that uv.lock provides.

Pinning your direct dependencies is a good start, but it does not pin their dependencies. You might list requests==2.31.0, but requests depends on urllib3. If urllib3 releases a breaking change, pip can pull in a newer version the next time someone runs pip install, even though your requirements.txt did not change.

# requirements.txt contains:
# requests==2.31.0
#
# urllib3 is not listed, so pip resolves a version at install time

import requests

response = requests.get("<https://api.example.com/data>")

Two teammates running pip install -r requirements.txt a week apart can end up with different environments. A broken CI run with no code change is often the first sign that this is happening.

<aside> ⚠️ Pinning top-level packages controls what you install directly. It does not fully control what those packages install underneath.

</aside>

So which approach should you choose?

<aside> 💡 Use requirements.txt when you are working in an existing codebase that already uses it. Use uv for new projects where you control the setup, because it gives you a proper lock file with no extra workflow.

</aside>

Option B: uv with pyproject.toml and uv.lock

uv uses pyproject.toml as the source of truth and writes exact versions, including transitive dependencies, to uv.lock.

[project]
name = "weather-pipeline"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = [
  "pandas==2.2.1",
  "requests==2.31.0",
]

# Install dependencies pinned in uv.lock
uv sync

# Run a command inside the managed environment
uv run python -m src.pipeline

<aside> 💡 Commit uv.lock. It is the record of the exact versions your CI and teammates should use.

</aside>

This is the main reason uv is recommended in this track: you get a faster workflow and stronger guarantees that CI, your laptop, and production install the same dependency graph.

Which package manager should you choose?

Use requirements.txt when:

you join an existing codebase that already uses it
the team already has scripts, Dockerfiles, or CI built around it
you need the simplest possible setup with the widest compatibility

Use uv when:

you start a new project
you want a lock file without extra tooling
you want stronger guarantees that local development, CI, and production install the same full dependency tree
you want Docker and CI builds to fail if the committed lock file is outdated

For this track, the recommendation is:

learn to recognize and work with requirements.txt
prefer uv for new work you control
use uv sync --frozen in Docker and CI so installs follow the committed lock file exactly

Runtime vs dev dependencies

Keep production dependencies separate from development tools like linters and tests. This keeps your container images smaller and your CI runs faster.

requirements.txt approach:

requirements.txt for runtime
requirements-dev.txt for dev tools

uv approach:

[project.optional-dependencies]
dev = [
  "pytest==8.2.0",
  "ruff==0.5.1",
]

<aside> ⚠️ If you install dev tools in production containers, you increase image size and risk extra vulnerabilities.

</aside>

Updating dependencies safely

A simple workflow:

Update your dependency file (requirements.txt or pyproject.toml).
Reinstall or sync.
Commit the dependency files and lock file if you use uv.

<aside> ⌨️ Hands on: Add pydantic to your dependency list, install it, and confirm your pipeline still runs.

</aside>

That small habit keeps your CI consistent with local development.

What reproducible CI actually means

A reproducible CI run means the same commit should install the same dependency set every time the pipeline runs. That matters because CI is your safety check before a deploy. If CI changes underneath you because a dependency resolver picks a newer transitive package, the signal becomes noisy:

a green build yesterday can become a red build today with no code change
two branches can behave differently for reasons unrelated to the code you wrote
debugging gets slower because you are chasing environment drift, not application logic

This does not mean requirements.txt is unusable. It means requirements.txt gives you a simpler workflow with weaker guarantees, while uv.lock gives you stronger guarantees with an extra lock file. That is why uv is recommended in this track for new projects, especially once CI and Docker are involved.

<aside> 💡 If you join an existing project that already uses requirements.txt, follow the project standard first. If you start a new project, prefer uv so your CI and local installs stay aligned more easily.

</aside>

The concept of deterministic builds applies across all languages and toolchains.

<aside> 🤓 Curious Geek: Lock files became standard

</aside>

Exercises

Explain one advantage of requirements.txt and one advantage of uv.
Add a dev dependency group and describe when you would install it.
Identify one dependency in your pipeline in week 3 or 4 that should be pinned and explain why.

🧠 Knowledge Check

What problem does a lock file solve for CI runs?
Why is uv the recommended route in this track?
Why should dev dependencies be kept out of production images?

Extra reading

uv documentation: fast Python package manager and resolver
pip user guide: official reference for pip workflows
pyproject.toml specification: standard format for Python project metadata

<aside> 💡 If you chose uv, see the Dockerfile with uv section in the Docker Fundamentals chapter for the correct setup. The mechanics differ from requirements.txt, especially the --frozen flag.

</aside>

Week 5 - Containers and CI/CD for Data Projects

Dependency Management

Suggested Topics

Concepts

requirements.txt vs uv

Option A: requirements.txt (classic workflow)

Option B: uv with pyproject.toml and uv.lock

Which package manager should you choose?

Runtime vs dev dependencies

Updating dependencies safely

What reproducible CI actually means

Exercises

🧠 Knowledge Check

Extra reading