Week 2 - Structuring Pipelines
Introduction to Data Pipelines
Configuration & Secrets (.env)
Linting and Formatting with Ruff
Week 2 Assignment: Clean Pipeline
A single place to look up every term you meet this week. Entries are grouped by the chapter where the term first appears, so the glossary mirrors your reading order. Each entry has a stable ### term header so chapters can link straight to it on first use of the term.
A process that moves data from a source (file, API, database) to a destination, applying transformations along the way. The unit of work for a data engineer. Reliability matters more than cleverness: a simple pipeline that runs every night beats a complex one that breaks on Wednesdays.
A script is a one-off file you run by hand; a system is code that runs unattended on infrastructure you do not own, with config, logging, error handling, and tests. Week 2 is about turning your Week 1 scripts into systems.
Restructuring existing code without changing its behaviour. The goal is to make the code easier to understand, test, or extend, not to add features.
An Azure container that groups related cloud resources (databases, storage accounts, virtual machines) so you can manage them as one unit. You meet it briefly in Ch1 and use it for real later in the track.
The settings your pipeline needs that change between environments: input paths, output paths, batch sizes, feature flags. Anything that is not "the logic" itself.
A credential your pipeline needs to access another system: API key, database password, OAuth token. Secrets must never be committed to git.
A key-value setting that lives in the operating system process, outside your code. Python reads them with os.environ.get("NAME").
A plain-text file in your project root that holds local environment-variable values. Loaded into the process by the python-dotenv library. Always added to .gitignore.
A committed template that lists the variable names (with placeholder values) so a teammate cloning the repo knows what to set in their own .env.
A single Python module that loads every environment variable once and exposes them as named imports. Stops os.environ.get(...) from being scattered through your codebase.
A 2011 set of principles for building deployable services. Factor III ("Store config in the environment") is where the .env pattern comes from.
A design principle that says each module/function should have one reason to change. In data pipelines: I/O lives at the edges, business logic lives in the middle.
A single function that reads, transforms, and writes data all at once. The anti-pattern this chapter exists to dismantle. Untestable, unreusable, fragile.
A design pattern where the parts of your code that touch the outside world (files, APIs, databases) are kept as thin as possible: they delegate all decisions to pure functions.
A function whose output depends only on its inputs, with no side effects. Same input → same output, every time. The easiest kind of code to test.
Any change a function makes outside of returning a value: writing a file, mutating a list, printing to stdout, hitting a network. Pure functions have none.
Instead of letting a function fetch its own dependencies (data, connections, config), pass them in as arguments. Makes the function testable in isolation.
Any operation that reads from or writes to the outside world: disk, network, database, terminal. Slow, unreliable, hard to test, which is why you isolate it.
Extract, Transform, Load. Read data from a source, transform it in memory, write to a target. The classic pipeline shape.
Extract, Load, Transform. Load raw data into a warehouse first, transform there using SQL. The modern cloud-warehouse shape (Snowflake, BigQuery, Synapse). You meet ELT properly in Week 10.
A paradigm built around classes that bundle state and behaviour. Use it when you need to remember things between calls or share configuration across methods.
A paradigm built around pure functions composed into pipelines. Use it for transformations: data in, data out, no memory.
Any data that persists between method calls: open connections, accumulated errors, counters. The signal that you might want a class.
A language (like Python or JavaScript) that supports several paradigms (OOP, functional, procedural) so you can pick the right tool per situation. Java is mostly OOP-only; Haskell is mostly functional-only.
A function that wraps another function or class to modify its behaviour. Written as @decorator_name on the line directly above. @dataclass is the first decorator most people meet in Python.
A method that receives the class itself (cls) instead of an instance (self). Used for factory methods that build instances from alternative inputs (e.g. from_env, from_dict).
A method that lives on a class but accesses neither self nor cls. A namespacing convenience for utility functions related to the class.
A method that creates and returns an instance of its class, often from a different input format than the default constructor accepts.
A class decorated with @dataclass that auto-generates __init__, __repr__, and equality methods from its field declarations. Cuts the boilerplate out of "classes that just hold data."
A method that runs automatically right after the dataclass-generated __init__ finishes. The standard place to put validation rules ("price must be positive").
The dataclass-safe way to give a field a mutable default (list, dict, set). Calling field(default_factory=list) creates a fresh list for each instance, avoiding the shared-mutable-default trap.
Bundling data and the methods that operate on that data into a single unit (a class). Lets you keep behaviour close to the data it depends on.
Converting an in-memory object into a portable format (dict, JSON, CSV row) so it can leave the process: written to disk, sent over a network, stored in a database.
The dataclasses.asdict helper that recursively converts a dataclass instance into a plain dict. The standard exit point from "rich object" land back to "portable data" land.
An older immutable alternative to dataclasses (collections.namedtuple). Behaves like a tuple you can index by name. Mostly superseded by dataclasses today.
Building a complex transformation by chaining smaller pure functions, where each function's output is the next function's input.
A function designed to chain: it takes data in, returns new data, never mutates its input, never returns None.
Changing the contents of an object in place (e.g. row["price"] = ..., rows.append(...)). The opposite of returning a new object. The source of most "why is the data wrong?" bugs in pipelines.
Python syntax that creates a new dict by copying every key from r and overriding "key". The standard way to write composable transforms without mutation.
A built-in that applies a function to every item in an iterable, returning an iterator. List comprehensions are the more idiomatic Python equivalent.
A built-in that keeps only the items where a predicate returns True. List comprehensions with an if clause are the more idiomatic Python equivalent.
A functools helper that combines all items in a collection into a single result by repeatedly applying a two-argument function. Rarely used directly in Python: sum(), max(), min() cover most cases.
Python's built-in syntax for building a new list from an existing iterable, e.g. [r for r in rows if r["price"] > 0]. Preferred over map/filter because it reads top-to-bottom.
A small helper that takes data and a sequence of functions, applying each in turn. Reads like a recipe: "take raw data, do A, then B, then C." toolz.pipe is the production version.
The 1978 design principle "do one thing well; chain tools together with pipes." The mental model behind both Unix commands and functional Python.