Python Setup

In this section, you'll set up a professional Python development environment. Having a consistent, reproducible setup is crucial for data engineering work - it ensures your code runs the same way everywhere.

Why Python?

Python has become the lingua franca of the data world. Whether you are building complex data pipelines, training machine learning models, or automating cloud infrastructure, Python is likely the tool you'll use.

The Go-To Language for Data & AI

Python dominates the data landscape for three main reasons:

Massive Ecosystem: From Pandas for data manipulation to PySpark for big data and Scikit-learn for machine learning, Python has a library for almost every data task.
Readability & Speed: Python's clean syntax allows you to focus on solving data problems rather than fighting with complex code. You can move from an idea to a working pipeline faster than in almost any other language.
Integration: Python plays well with everything. It has first-class support for all major cloud providers (Azure, AWS, GCP) and can easily interface with high-performance tools written in C or Rust.

Python in Data Engineering (DE)

As a Data Engineer, you will use Python for:

ETL Pipelines: Extracting data from APIs or databases, transforming it, and loading it into a data warehouse.
Orchestration: Using tools like Apache Airflow (written in Python) to schedule and monitor complex workflows.
Data Validation: Writing scripts to ensure the data you're moving is high-quality and follows the correct schema.
MLOps: Building the infrastructure that allows Machine Learning models to run reliably in production.

By learning Python, you're not just learning a programming language; you're gaining the ability to orchestrate the entire lifecycle of data.

Where Does Python Sit?

Where Python sits in the language hierarchy: Python (interpreted, high-level) sits at the top above C/C++/Rust (compiled), Assembly Language, Machine Code (0s and 1s), and Hardware (CPU, RAM) at the bottom.

Think of programming languages as layers in a pyramid. At the very bottom is hardware, the physical CPU and RAM. Just above that is machine code, the raw 0s and 1s that your processor understands. Then comes assembly language, a slightly more human-readable form of machine code. Climbing higher, we find compiled languages like C, C++, and Rust, which translate your code into machine instructions before running.

At the top of this pyramid sits Python. It's an interpreted, high-level language. "High-level" means Python handles a lot of the low-level details (like memory management) for you, so you can focus on the problem you're solving, not the machine you're running on. "Interpreted" means your Python code is translated and executed line-by-line at runtime, rather than being compiled into machine code beforehand. This makes it incredibly flexible and fast to iterate with, which is exactly what you want when building data pipelines.

<aside> 🤓 The Curious Geek: Want to know about the difference between compiled and interpreted languages? Check out Compiled vs Interpreted Languages on freeCodeCamp.

</aside>

A Brief History of Python

Python was created by Dutch Programmer Guido van Rossum and first released in 1991. Named after Monty Python (not the snake!), it was designed to be readable and beginner-friendly while remaining powerful enough for complex applications.

Key milestones:

Year	Event
1991	Python 0.9.0 released
2000	Python 2.0 - list comprehensions, garbage collection
2008	Python 3.0 - major backwards-incompatible release
2020	Python 2 officially sunset (end of life)
2022	Python 3.11 - significant performance improvements (minor bump)

<aside> 🤓 Curious Geek: Reading version numbers

Most modern software follows semantic versioning: MAJOR.MINOR.PATCH. Bump MAJOR for breaking changes, MINOR for new features that keep old code working, PATCH for bug fixes. So 3.11.12 → 3.11.13 is a safe bug-fix; 3.11 → 3.12 adds features without breaking your scripts; 2 → 3 is the rare breaking jump that took the Python world about a decade to migrate. Full spec at semver.org.

</aside>

The Python 2 → 3 jump is exactly the kind of major-version migration semver warns about. It is also why the next callout matters in 2026:

<aside> ⚠️ Never use Python 2. It's been dead since January 2020. If you see python vs python3 on your system, always use python3. Some old tutorials still reference Python 2 - ignore them.

</aside>

Why do versions matter?

In data engineering, Python versions are critical because:

Libraries lag behind - Some packages don't support the newest Python immediately
Production stability - You can't just upgrade Python in a running system
Reproducibility - Your code should work the same everywhere
Team consistency - Everyone needs the same version

This is why we use virtual environments (covered below) and always specify Python versions.

<aside> 🎬 Want more history? Watch Python: The Documentary (1h 30m) - interviews with Guido van Rossum and how Python became one of the world's most popular languages.

</aside>

https://www.youtube.com/watch?v=GfH4QL4VqJ0

Installing Python 3.11

<aside> ⚠️ Even if you have Python installed from the Core program, make sure you have Python 3.11 specifically. Data engineering projects often require specific Python versions for compatibility.

</aside>

Installing on macOS (click to expand)

The recommended way to install Python on macOS is using Homebrew:

# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL <https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh>)"

# Install Python 3.11
brew install [email protected]

# Verify installation
python3.11 --version
# Python 3.11.12

Windows (click to expand)

Download Python 3.11 from python.org
Run the installer
Important: Check "Add Python to PATH" during installation
Verify in PowerShell:

python --version
# Python 3.11.12

Linux - Ubuntu/Debian (click to expand)

sudo apt update
sudo apt install python3.11 python3.11-venv python3-pip

# Verify installation
python3.11 --version
# Python 3.11.12

Virtual Environments

<aside> 💡 Virtual environments isolate your project's dependencies from other projects and your system Python. This is essential for reproducible data pipelines.

</aside>

A virtual environment is like a clean room for your Python project. Each project gets its own isolated set of packages.

How a virtual environment isolates a project: each project gets its own pinned package versions

Why use virtual environments?

Isolation: Different projects can use different package versions
Reproducibility: Your code will work the same way on any machine
Clean uninstall: Delete the folder to remove all packages

Creating a virtual environment

# Navigate to your project folder
cd my-data-project

# Create a virtual environment named 'venv'
python3.11 -m venv venv

# Activate it (macOS/Linux)
source venv/bin/activate

# Activate it (Windows PowerShell)
.\venv\Scripts\Activate.ps1

# Your prompt should now show (venv)

<aside> 💡 Pro Tip: The modern way with uv

While venv and pip are standard, many professional Data Engineers now use uv. It's an incredibly fast Python package manager written in Rust.

If you have uv installed, you can replace the steps above with:

uv venv          # Create venv
uv pip install X  # Install X

Or even better, use uv run script.py to run a script with its dependencies automatically! We use uv to manage this repository.

</aside>

This pattern of managing dependencies mirrors the workflow you used in the Core program with npm and package.json.

<aside> 📘 Core program connection: In the Core program with JavaScript you used npm, package.json, and package-lock.json to install packages and keep versions reproducible. In Python, uv solves a very similar problem with pyproject.toml and uv.lock. The exact tools are different, but the goal is the same: every machine and CI runner should install the same dependency set. Refresh the Core program chapter here: https://www.notion.so/hackyourfuture/Package-managers-2b250f64ffc9800d8c76e5fec3aa8095

</aside>

Installing packages

With your virtual environment activated:

# Install a package
pip install pandas

# Install multiple packages
pip install pandas numpy

# Save your dependencies to a file
pip freeze > requirements.txt

# Install from requirements file (on another machine)
pip install -r requirements.txt

A package manager is a tool that installs and updates external libraries for your project. In Python, the classic tool is pip. Newer tools like uv do the same job, but also improve speed and reproducibility.

requirements.txt is a simple way to list direct dependencies. It works well and you will still see it in many Python projects and Dockerfiles.

uv goes further by using a lock file (uv.lock). That lock file pins not only the packages you chose directly, but also the packages those packages depend on. This makes installs more reproducible across laptops, CI, and production.

# Sync from pyproject.toml and uv.lock
uv sync

<aside> 💡 A lock file answers a practical question: "Will my teammate, the CI runner, and production install the exact same dependency tree?"

</aside>

<aside> 💡 You will still see requirements.txt in many real Python projects and in older codebases. It is widely used and worth recognizing. In this track, uv is the recommended route for new work because uv.lock gives stronger reproducibility.

</aside>

Week 5 will revisit this topic in more depth: what package managers do, why lock files matter, and when to choose requirements.txt or uv. See: Dependency Management

<aside> ⌨️ Hands-on: Create a new folder called week1-practice, create a virtual environment in it, and install the requests package.

</aside>