Week 1 - Python Foundations
Python Setup
Data Types and Variables
Control Flow: Logic and Loops
Functions and Modules
Type Hints for Clearer Code
Command-Line Interface Habits
Errors and Debugging
Logging in Python
File Operations
Azure Setup and Account Access
Practice
Week 1 Gotchas & Pitfalls
Week 1 Assignment: The Data Cleaning Pipeline
Week 1 Glossary
Going Further: Optional Deep Dives
Week 1 Kickoff Slides
Week 1 Glossary
A single place to look up every Python and data-engineering term you meet this week. Entries are grouped by the chapter where the term first appears, so the glossary mirrors your reading order. Inside each chapter, related terms sit next to each other under a small themed subsection. While reading Ch1, focus on the Ch1 section and ignore later sections until you reach the matching chapter.
Ch1: Python Setup
Data engineering work
- ETL Pipeline: Extract, Transform, Load. The repeating pattern of pulling data from a source (API, database, file), reshaping it, and writing the result to a target (warehouse, lake, table). Most of what you build in this track is an ETL pipeline of some kind.
- Orchestration: coordinating pipeline tasks, their dependencies, retries, and run history across a full workflow. The job of tools like Apache Airflow (Week 11). Plain cron only handles "when to start"; orchestration handles the rest.
- Data Validation: checking that incoming or outgoing data matches the schema, ranges, and rules you expect (no nulls in a primary key, dates inside a sane window, no rows with negative prices). Catches bad data before it pollutes the warehouse.
- Machine Learning (ML): a field of AI where computers learn patterns from data to make predictions or decisions without being explicitly programmed for every scenario. Data engineers build the pipelines that feed these models.
- MLOps: the operational discipline that keeps machine-learning models running reliably in production: training pipelines, model versioning, monitoring, and rollback. The data-engineering side of ML.
- Pipeline: the end-to-end sequence of data steps you want to run together. In Week 1 your first pipeline is "read CSV → clean → write JSON."
Python ecosystem
- Python: an interpreted, high-level programming language. The default language for data engineering because of its readable syntax and massive library ecosystem.
- Apache Airflow: an open-source orchestrator written in Python. You meet it properly in Week 11.
- Pandas: the standard Python library for in-memory tabular data manipulation (DataFrames). You meet it properly in Week 4 (Data Processing with Pandas).
- PySpark: the Python interface to Apache Spark, used for distributed processing of big data (datasets that do not fit in one machine's memory).
- Scikit-learn: the standard Python library for classical machine learning.
Hardware and architecture
- Hardware: the physical components of a computer, such as the CPU (Central Processing Unit, the brain) and RAM (Random Access Memory, the short-term memory).
Languages and compilation
- Interpreted language: a language where source code is translated and executed line-by-line at runtime by an interpreter. Python is interpreted.
- Compiled language: a language where source code is translated into machine code ahead of time by a compiler, producing a binary you then execute. C, C++, Rust are compiled. Compiled code is usually faster but slower to iterate on.
- High-level language: a language that hides low-level details (memory management, CPU registers, pointers) so you can focus on the problem you are solving. Python is high-level; C is low-level.
- Machine code: the raw 0s and 1s the CPU executes directly. Compiled languages produce machine code; interpreted languages produce it indirectly through the interpreter.
- Assembly language: a low-level programming language that is very close to machine code. It uses slightly more human-readable mnemonics (like
MOV or ADD) instead of raw 0s and 1s.
- Garbage collection: the runtime feature that automatically frees memory used by objects you no longer reference, so you do not need to call
free() like you would in C. Python has had garbage collection since the start; Python 2.0 added a smarter cycle-detecting collector that catches objects that reference each other.
- C: a classic compiled, low-level language. Much of Python's high-performance data processing is actually written in C under the hood.
- C++: a compiled, middle-level language that expands on C. Often used for high-performance applications and system software.
- Rust: a modern compiled language focused on safety and performance. Many newer Python data tools (like
uv and polars) are written in Rust.
Versioning and releases
- Release: a specific version of a piece of software that has been published for others to download and use. "Python 0.9.0 released" means the maintainers tagged that version, packaged it, and made it available to the public, after which any code with that version number refers to that exact frozen snapshot. Most projects ship many releases over their lifetime: bug-fix patches, feature minor versions, and the occasional breaking major version.
- Bump (a version): jargon for "increase the version number when shipping a new release."
MAJOR bumps signal breaking changes, MINOR bumps add features, PATCH bumps ship bug fixes. You will see "bump pandas to 2.2.1" in pull-request titles all the time: it just means "raise the pinned version."
- Semantic versioning (semver): the
MAJOR.MINOR.PATCH numbering rule used by most modern software (Python, Node, npm packages, dbt, Airflow). Bump MAJOR for backwards-incompatible changes, MINOR for backwards-compatible new features, PATCH for bug fixes. So 3.11.12 → 3.11.13 is a bug-fix release; 3.11 → 3.12 adds features without breaking your code; 2 → 3 is a breaking change. See semver.org for the full spec.
- Major version: the leftmost number in a semver string. Bumping it means existing code may break. Python 2 → Python 3 was a major bump that took the ecosystem about a decade to migrate.
- Minor version: the middle number. Bumping it adds new features but keeps existing code working. Python 3.10 → 3.11 was a minor bump that brought significant performance improvements without breaking valid 3.10 code.
- Patch version: the rightmost number. Bumping it ships bug fixes and security patches only.
3.11.11 → 3.11.12 is a patch release: safe to upgrade in production.
- Backwards-incompatible (breaking change): a release where code that worked on the previous version stops working. The
print statement becoming a function in Python 3.0 is the canonical example.
- Sunset / End-of-life (EOL): the date after which a software version stops receiving any updates, including security fixes. Python 2 reached EOL on January 1, 2020. After EOL, running the version is a security risk and packages drop support for it.
Cloud infrastructure
- Azure: Microsoft's cloud computing platform. The primary cloud used for class infrastructure in this track.
- AWS: Amazon Web Services. The oldest and largest cloud provider.
- GCP: Google Cloud Platform. Google's suite of cloud computing services.
Virtual environments and packaging
- Virtual environment (
venv): an isolated folder containing its own Python interpreter and packages, separate from the system Python. One per project, so projects with conflicting dependency versions cannot break each other.
- Package: a unit of reusable Python code installed from an index like PyPI.
pandas, requests, airflow are packages.
- Package manager: a tool that installs, upgrades, and removes packages from your project.
pip is the classic Python package manager; uv is a faster modern alternative.
pip: the standard Python package installer. Ships with Python. Reads from requirements.txt or installs ad-hoc with pip install <name>.
requirements.txt: a plain-text file listing the packages a project needs, one per line, optionally with version pins like pandas==2.2.0. The minimum-viable lock file.
uv: a fast Python package manager and resolver written in Rust. Replaces pip + venv + pip-tools with one tool. Used to manage this repository.
- Lock file (
uv.lock): a file that pins not just the packages you chose directly, but every transitive dependency too. Guarantees that everyone (you, your teammate, CI, production) installs the exact same dependency tree.
pyproject.toml: the modern standard configuration file for Python projects. Declares package metadata, dependencies, and tool settings. uv reads from it.
Editor and shell
- VS Code: the editor recommended for this track. Free, cross-platform, with a strong Python extension by Microsoft.
- Python interpreter (in VS Code): the specific
python executable VS Code uses for running and analyzing your code. After creating a venv, you tell VS Code to use that venv's interpreter, otherwise IntelliSense and "Run" pick the wrong one.
- PATH: an environment variable listing the folders the shell searches when you type a command. "Add Python to PATH" during install means the shell can find
python from any folder.
Automation and CI/CD
- CI/CD: Continuous Integration and Continuous Deployment. The practice of automatically testing and deploying your code every time you make a change.
- CI runner: the remote machine that executes your CI/CD pipelines (e.g., GitHub Actions runners). It needs a reproducible environment (like a lock file) to ensure tests run exactly the same as on your laptop.
Ch2: Data Types and Variables
Data types
- Integer (
int): a numeric data type representing whole numbers (positive or negative) without any decimal points. Used for counts, IDs, and loop counters.
- Floating-point number (
float): a numeric data type representing numbers with decimal points. Note that floats are an approximation of real numbers and can have precision issues when doing complex arithmetic.
- Boolean (
bool): a data type that can only have two values: True or False. Used for conditional logic and flags.
Python types and behaviour
- Dynamically typed: a language where the type of a variable is determined at runtime from the value you assign, not declared up front.
x = 42 makes x an int; x = "hello" later in the same script reassigns it to a str without complaint. The opposite is statically typed (Java, C#, TypeScript), where you must declare types at compile time.
- Statically typed: a language where every variable has a declared type that is checked before the program runs. Java, C#, Rust, and TypeScript are statically typed. Catches type mismatches earlier but makes you write more boilerplate. Python's optional type hints (covered in Ch5) bring some of this safety to a dynamically-typed language.
- Mutable: an object whose contents can be changed after creation. Lists, dictionaries, and sets are mutable:
my_list.append(3) modifies the original list in place.
- Immutable: an object whose contents cannot be changed after creation. Strings, integers, floats, booleans, and tuples are immutable. Operations that look like they "modify" them actually return a new object:
"hi".upper() creates and returns "HI", leaving "hi" unchanged.
- Truthiness: how Python evaluates non-boolean values in a boolean context (like
if x:). Values that act like True are called truthy (non-empty strings, non-zero numbers); those that act like False are falsy (None, 0, 0.0, "", [], {}, set()). So if my_list: is the idiomatic way to ask "does this list have anything in it?"
- Indexing: the act of accessing a single element from a sequence (like a list or string) using its position number. Python uses zero-based indexing, meaning the first element is at index
0.
- Slicing: a technique for extracting a sub-portion of a sequence (like a list or string) using the
[start:stop:step] syntax.
- Nesting: the practice of placing one data structure inside another (e.g., a list of dictionaries, or a dictionary where one value is another dictionary). Common for representing complex, hierarchical data like JSON.
- PEP 8: the official Python style guide (peps.python.org/pep-0008). Defines naming conventions (
snake_case for variables and functions, PascalCase for classes, UPPER_SNAKE_CASE for constants), indentation (4 spaces), maximum line length (typically 88 or 99), and other formatting rules. Following PEP 8 makes your code feel idiomatic to other Python developers.
Ch3: Control Flow
Control flow
- Control flow: the order in which a program's statements are executed. Conditionals (
if/elif/else) and loops (for, while) are the two primary tools you use to bend control flow away from straight top-to-bottom execution.