Week 1 -Foundational Python

Python Setup

Data Types and Variables

Control Flow

Functions and Modules

Type Hinting

CLI Habits

Errors and Debugging

Logging in Python

File Operations

[Cloud] Azure Setup

Gotchas & Pitfalls

Practice

Assignment

Back to Track

Python Setup

In this section, you'll set up a professional Python development environment. Having a consistent, reproducible setup is crucial for data engineering work - it ensures your code runs the same way everywhere.

Why Python?

Python has become the lingua franca of the data world. Whether you are building complex data pipelines, training machine learning models, or automating cloud infrastructure, Python is likely the tool you'll use.

The Go-To Language for Data & AI

Python dominates the data landscape for three main reasons:

  1. Massive Ecosystem: From Pandas for data manipulation to PySpark for big data and Scikit-learn for machine learning, Python has a library for almost every data task.
  2. Readability & Speed: Python's clean syntax allows you to focus on solving data problems rather than fighting with complex code. You can move from an idea to a working pipeline faster than in almost any other language.
  3. Integration: Python plays well with everything. It has first-class support for all major cloud providers (Azure, AWS, GCP) and can easily interface with high-performance tools written in C or Rust.

Python in Data Engineering (DE)

As a Data Engineer, you will use Python for:

By learning Python, you're not just learning a programming language; you're gaining the ability to orchestrate the entire lifecycle of data.

Where Does Python Sit?

https://htmlpreview.github.io/?https://gist.githubusercontent.com/lassebenni/bacb6cda40cbe2207bdbccd348659ab3/raw/python_hierarchy.html

Think of programming languages as layers in a pyramid. At the very bottom is **hardware (**the physical CPU and RAM). Just above that is machine code, the raw 0s and 1s that your processor understands. Then comes assembly language, a slightly more human-readable form of machine code. Climbing higher, we find compiled languages like C, C++, and Rust, which translate your code into machine instructions before running.

At the top of this pyramid sits Python. It's an interpreted, high-level language. "High-level" means Python handles a lot of the low-level details (like memory management) for you, so you can focus on the problem you're solving, not the machine you're running on. "Interpreted" means your Python code is translated and executed line-by-line at runtime, rather than being compiled into machine code beforehand. This makes it incredibly flexible and fast to iterate with, which is exactly what you want when building data pipelines.

🤓 The Curious Geek: Want to know about the difference between compiled and interpreted languages? Check out Compiled vs Interpreted Languages on freeCodeCamp.

A Brief History of Python

Python was created by Dutch Programmer Guido van Rossum and first released in 1991. Named after Monty Python (not the snake!), it was designed to be readable and beginner-friendly while remaining powerful enough for complex applications.

Key milestones:

Year Event
1991 Python 0.9.0 released
2000 Python 2.0 - list comprehensions, garbage collection
2008 Python 3.0 - major backwards-incompatible release
2020 Python 2 officially sunset (end of life)
2022 Python 3.11 - significant performance improvements

⚠️ Never use Python 2. It's been dead since January 2020. If you see python vs python3 on your system, always use python3. Some old tutorials still reference Python 2 - ignore them.

Why do versions matter?

In data engineering, Python versions are critical because:

This is why we use virtual environments (covered below) and always specify Python versions.

🎬 Want more history? Watch Python: The Documentary (1h 30m) - interviews with Guido van Rossum and how Python became one of the world's most popular languages.

Installing Python 3.11

⚠️ Even if you have Python installed from the Core program, make sure you have Python 3.11 specifically. Data engineering projects often require specific Python versions for compatibility.

Installing on macOS (click to expand)

The recommended way to install Python on macOS is using Homebrew:

# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL <https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh>)"

# Install Python 3.11
brew install [email protected]

# Verify installation
python3.11 --version
# Python 3.11.12

Windows (click to expand)

  1. Download Python 3.11 from python.org
  2. Run the installer
  3. Important: Check "Add Python to PATH" during installation
  4. Verify in PowerShell:
python --version
# Python 3.11.12

Linux - Ubuntu/Debian (click to expand)

sudo apt update
sudo apt install python3.11 python3.11-venv python3-pip

# Verify installation
python3.11 --version
# Python 3.11.12

Virtual Environments

💡 Virtual environments isolate your project's dependencies from other projects and your system Python. This is essential for reproducible data pipelines.

A virtual environment is like a clean room for your Python project. Each project gets its own isolated set of packages.

https://htmlpreview.github.io/?https://gist.githubusercontent.com/lassebenni/d1b6d921baa6621c0f53c73df53ce424/raw/virtual_env_animation.html

Why use virtual environments?

Creating a virtual environment

# Navigate to your project folder
cd my-data-project

# Create a virtual environment named 'venv'
python3.11 -m venv venv

# Activate it (macOS/Linux)
source venv/bin/activate

# Activate it (Windows PowerShell)
.\\\\\\\\venv\\\\\\\\Scripts\\\\\\\\Activate.ps1

# Your prompt should now show (venv)

Installing packages

With your virtual environment activated:

# Install a package
pip install pandas

# Install multiple packages
pip install pandas numpy

# Save your dependencies to a file
pip freeze > requirements.txt

# Install from requirements file (on another machine)
pip install -r requirements.txt

⌨️ Hands-on: Create a new folder called week1-practice, create a virtual environment in it, and install the requests package.

VS Code Python Extension

VS Code needs the Python extension to provide features like IntelliSense, debugging, and linting.

Installation

  1. Open VS Code
  2. Press Cmd+Shift+X (macOS) or Ctrl+Shift+X (Windows/Linux) to open Extensions
  3. Search for "Python"
  4. Install the extension by Microsoft (it should be the first result)

Selecting the Python interpreter

After creating a virtual environment, tell VS Code to use it:

  1. Press Cmd+Shift+P (macOS) or Ctrl+Shift+P (Windows/Linux)
  2. Type "Python: Select Interpreter"
  3. Choose the interpreter from your venv folder

💡 VS Code should automatically detect virtual environments in your workspace. Look for ./venv/bin/python or .\\venv\\Scripts\\python.exe in the list.

Recommended settings

Add these to your VS Code settings (.vscode/settings.json):

{
    "python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
    "python.terminal.activateEnvironment": true,
    "editor.formatOnSave": true,
    "python.formatting.provider": "black"
}

🧠 Knowledge Check

  1. Why is it important to use a virtual environment (venv) for each project?
  2. What does the command source venv/bin/activate do?
  3. If you see "Module not found" errors even though you installed the package, what setting should you check in VS Code?

Extra reading

Best Practices


Next lesson: Functions and Modules