Week 2 - Structuring Pipelines

Introduction to Data Pipelines

Configuration & Secrets (.env)

Separation of Concerns

OOP vs Functional Programming

Dataclasses for Data Objects

Functional Composition

Testing with Pytest

Linting and Formatting with Ruff

Practice

Gotchas & Pitfalls

Week 2 Assignment: Clean Pipeline

Week 2 Glossary

Going Further

Career relevance: Week 2

Week 2 Kickoff Slides

Week 2 Glossary

A single place to look up every term you meet this week. Entries are grouped by the chapter where the term first appears, so the glossary mirrors your reading order. Each entry has a stable ### term header so chapters can link straight to it on first use of the term.

Ch1: Introduction to Data Pipelines

Data pipeline

A process that moves data from a source (file, API, database) to a destination, applying transformations along the way. The unit of work for a data engineer. Reliability matters more than cleverness: a simple pipeline that runs every night beats a complex one that breaks on Wednesdays.

Script vs system

A script is a one-off file you run by hand; a system is code that runs unattended on infrastructure you do not own, with config, logging, error handling, and tests. Week 2 is about turning your Week 1 scripts into systems.

Refactor

Restructuring existing code without changing its behaviour. The goal is to make the code easier to understand, test, or extend, not to add features.

Resource group

An Azure container that groups related cloud resources (databases, storage accounts, virtual machines) so you can manage them as one unit. You meet it briefly in Ch1 and use it for real later in the track.

Ch2: Configuration & Secrets

Configuration

The settings your pipeline needs that change between environments: input paths, output paths, batch sizes, feature flags. Anything that is not "the logic" itself.

Secret

A credential your pipeline needs to access another system: API key, database password, OAuth token. Secrets must never be committed to git.

Environment variable

A key-value setting that lives in the operating system process, outside your code. Python reads them with os.environ.get("NAME").

.env file

A plain-text file in your project root that holds local environment-variable values. Loaded into the process by the python-dotenv library. Always added to .gitignore.

.env.example

A committed template that lists the variable names (with placeholder values) so a teammate cloning the repo knows what to set in their own .env.

config.py

A single Python module that loads every environment variable once and exposes them as named imports. Stops os.environ.get(...) from being scattered through your codebase.

12-Factor App

A 2011 set of principles for building deployable services. Factor III ("Store config in the environment") is where the .env pattern comes from.

Ch3: Separation of Concerns

Separation of concerns

A design principle that says each module/function should have one reason to change. In data pipelines: I/O lives at the edges, business logic lives in the middle.

God function

A single function that reads, transforms, and writes data all at once. The anti-pattern this chapter exists to dismantle. Untestable, unreusable, fragile.

Humble Object

A design pattern where the parts of your code that touch the outside world (files, APIs, databases) are kept as thin as possible: they delegate all decisions to pure functions.

Pure function

A function whose output depends only on its inputs, with no side effects. Same input → same output, every time. The easiest kind of code to test.

Side effect

Any change a function makes outside of returning a value: writing a file, mutating a list, printing to stdout, hitting a network. Pure functions have none.

Dependency injection

Instead of letting a function fetch its own dependencies (data, connections, config), pass them in as arguments. Makes the function testable in isolation.

I/O (Input/Output)

Any operation that reads from or writes to the outside world: disk, network, database, terminal. Slow, unreliable, hard to test, which is why you isolate it.

ETL

Extract, Transform, Load. Read data from a source, transform it in memory, write to a target. The classic pipeline shape.

ELT

Extract, Load, Transform. Load raw data into a warehouse first, transform there using SQL. The modern cloud-warehouse shape (Snowflake, BigQuery, Synapse). You meet ELT properly in Week 10.

Ch4: OOP vs Functional Programming

OOP (Object-Oriented Programming)

A paradigm built around classes that bundle state and behaviour. Use it when you need to remember things between calls or share configuration across methods.

Functional programming

A paradigm built around pure functions composed into pipelines. Use it for transformations: data in, data out, no memory.

State

Any data that persists between method calls: open connections, accumulated errors, counters. The signal that you might want a class.

Multi-paradigm language

A language (like Python or JavaScript) that supports several paradigms (OOP, functional, procedural) so you can pick the right tool per situation. Java is mostly OOP-only; Haskell is mostly functional-only.

Decorator

A function that wraps another function or class to modify its behaviour. Written as @decorator_name on the line directly above. @dataclass is the first decorator most people meet in Python.

@classmethod

A method that receives the class itself (cls) instead of an instance (self). Used for factory methods that build instances from alternative inputs (e.g. from_env, from_dict).

@staticmethod

A method that lives on a class but accesses neither self nor cls. A namespacing convenience for utility functions related to the class.

Factory method

A method that creates and returns an instance of its class, often from a different input format than the default constructor accepts.

Ch5: Dataclasses for Data Objects

Dataclass

A class decorated with @dataclass that auto-generates __init__, __repr__, and equality methods from its field declarations. Cuts the boilerplate out of "classes that just hold data."

postinit_

A method that runs automatically right after the dataclass-generated __init__ finishes. The standard place to put validation rules ("price must be positive").

field(default_factory=...)

The dataclass-safe way to give a field a mutable default (list, dict, set). Calling field(default_factory=list) creates a fresh list for each instance, avoiding the shared-mutable-default trap.

Encapsulation

Bundling data and the methods that operate on that data into a single unit (a class). Lets you keep behaviour close to the data it depends on.

Serialization

Converting an in-memory object into a portable format (dict, JSON, CSV row) so it can leave the process: written to disk, sent over a network, stored in a database.

asdict()

The dataclasses.asdict helper that recursively converts a dataclass instance into a plain dict. The standard exit point from "rich object" land back to "portable data" land.

NamedTuple

An older immutable alternative to dataclasses (collections.namedtuple). Behaves like a tuple you can index by name. Mostly superseded by dataclasses today.

Ch6: Functional Composition

Functional composition

Building a complex transformation by chaining smaller pure functions, where each function's output is the next function's input.

Composable function

A function designed to chain: it takes data in, returns new data, never mutates its input, never returns None.

Mutation

Changing the contents of an object in place (e.g. row["price"] = ..., rows.append(...)). The opposite of returning a new object. The source of most "why is the data wrong?" bugs in pipelines.

Spread pattern ({**r, "key": value})

Python syntax that creates a new dict by copying every key from r and overriding "key". The standard way to write composable transforms without mutation.

map()

A built-in that applies a function to every item in an iterable, returning an iterator. List comprehensions are the more idiomatic Python equivalent.

filter()

A built-in that keeps only the items where a predicate returns True. List comprehensions with an if clause are the more idiomatic Python equivalent.

reduce()

A functools helper that combines all items in a collection into a single result by repeatedly applying a two-argument function. Rarely used directly in Python: sum(), max(), min() cover most cases.

List comprehension

Python's built-in syntax for building a new list from an existing iterable, e.g. [r for r in rows if r["price"] > 0]. Preferred over map/filter because it reads top-to-bottom.

Pipeline pattern (pipe)

A small helper that takes data and a sequence of functions, applying each in turn. Reads like a recipe: "take raw data, do A, then B, then C." toolz.pipe is the production version.

Unix philosophy

The 1978 design principle "do one thing well; chain tools together with pipes." The mental model behind both Unix commands and functional Python.

Ch7: Testing with Pytest

pytest