Week 2 - Structuring Data Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns (I/O vs Logic)

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Gotchas & Pitfalls

Assignment: Refactoring to a Clean Pipeline

Configuration & Secrets (.env)

Last chapter, you saw how a data pipeline moves data from sources to storage, transforming it along the way. This chapter covers how to keep your pipelines safe and flexible: how to handle the settings and secrets that your pipeline needs, like API keys.

You'll see how to use environment variables, .env files, and a centralized config module so your credentials don't get hardcoded in your code. By the end, your pipelines will be easier to configure, safer to share, and ready to run in different environments.

Why Hardcoding is a Problem

Imagine this:

API_KEY = "123abc"
DB_PASSWORD = "secret123"

If you commit this to Git, push to GitHub, or share your code, anyone can see your credentials. This is a security disaster! Keys can be leaked, databases compromised, and real jobs broken.

Hardcoding also makes it hard to switch environments (development, staging, production) because you'd have to change the code each time.

<aside> ๐Ÿ’ก Even "temporary" hardcoded secrets tend to stay in codebases forever. Treat every secret as if it will be leaked.

</aside>

The same caution applies when you ask an LLM for help.

<aside> โš ๏ธ Using AI to help: Never paste your real .env contents, API keys, or database passwords into ChatGPT, Claude, or any LLM (โš ๏ธ no PII or production credentials!). When you need help debugging config code, replace real values with API_KEY=PLACEHOLDER style fakes first.

</aside>

The solution? Separate configuration from code.

Environment Variables

Environment variables are settings that live outside your code. Your operating system keeps them, and Python can read them with the os module.

import os

api_key = os.environ.get("API_KEY")
print(api_key)

The .env Pattern

.env files are simple text files that hold your secrets locally.

Example .env file:

API_KEY="123abc"
DB_PASSWORD="secret123"

To load these variables into Python, use the python-dotenv library.

# install with: pip install python-dotenv
from dotenv import load_dotenv
import os

load_dotenv()  # reads .env file and loads variables

api_key = os.environ.get("API_KEY")
db_password = os.environ.get("DB_PASSWORD")
print(api_key, db_password)

<aside> ๐ŸŽฌ Terminal Tutorial: Setting Up .env with python-dotenv

</aside>

Advantages of .env:

Centralizing Settings with config.py

Once you have environment variables, you might find yourself calling os.environ.get(...) everywhere. That can get messy. Instead, create a single config.py module. Save the following as config.py in your project root:

# config.py
import os
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ.get("API_KEY")
DB_PASSWORD = os.environ.get("DB_PASSWORD")
DB_URL = os.environ.get("DB_URL", "sqlite:///default.db")

Now, from any other file in the same project (with config.py and .env present alongside it):

<!-- runner:expect-fail -->

from config import API_KEY, DB_URL

print("Connecting to database at", DB_URL)

This makes your code cleaner and more maintainable.

<aside> ๐Ÿ’ก Beyond .env: production secrets live in a vault. A .env file is the right pattern for local development. In production, you store secrets in a managed vault like Azure Key Vault, not on disk. The vault enforces access policies, rotation, and audit logging. Week 12 covers the migration from .env to Key Vault: same os.environ.get(...) interface, different source of truth.

</aside>

For now, focus on the local pattern. The first habit is making missing values fail loudly.

<aside> โŒจ๏ธ Hands on: Implement a get_config(var_name) function that reads an environment variable and raises a ValueError if it's not set. Never let None slip through silently!

๐Ÿš€ Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=2&chapter=config_secrets&exercise=w2_config_secrets__safe_config&lang=python

</aside>

The pattern itself isn't new: it has a name and a manifesto.

<aside> ๐Ÿค“ Curious Geek: The 12-Factor App

The idea of separating config from code comes from the 12-Factor App methodology, a set of principles written by Heroku engineers in 2011.

Factor III states: "Store config in the environment." This means every setting that changes between environments (dev, staging, prod) should be an environment variable, never hardcoded.

Most modern deployment tools (Docker, Kubernetes, Heroku, Railway) follow this pattern natively!

</aside>

With the pattern in hand, the next step is to apply it.

<aside> ๐Ÿ“ Practice: Apply this pattern in Practice Exercise 1: Move Secrets to .env.

</aside>

๐Ÿง  Knowledge Check

Extra reading

<aside> ๐Ÿ’ก In the wild: The python-dotenv library you just learned is used by thousands of Python projects. Open its README to see the same patterns: load_dotenv(), os.environ, and .env.example files.

</aside>


Next up: Separation of Concerns, where you split the "god function" into a thin I/O layer and pure logic functions, so your business rules can be tested without touching the filesystem.


The HackYourFuture curriculum is licensed underย CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with โค๏ธ by the HackYourFuture community ยท Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.