In this section, you'll set up a professional Python development environment. Having a consistent, reproducible setup is crucial for data engineering work - it ensures your code runs the same way everywhere.
Python has become the lingua franca of the data world. Whether you are building complex data pipelines, training machine learning models, or automating cloud infrastructure, Python is likely the tool you'll use.
Python dominates the data landscape for three main reasons:
As a Data Engineer, you will use Python for:
By learning Python, you're not just learning a programming language; you're gaining the ability to orchestrate the entire lifecycle of data.
https://htmlpreview.github.io/?https://gist.githubusercontent.com/lassebenni/bacb6cda40cbe2207bdbccd348659ab3/raw/python_hierarchy.html
Think of programming languages as layers in a pyramid. At the very bottom is **hardware (**the physical CPU and RAM). Just above that is machine code, the raw 0s and 1s that your processor understands. Then comes assembly language, a slightly more human-readable form of machine code. Climbing higher, we find compiled languages like C, C++, and Rust, which translate your code into machine instructions before running.
At the top of this pyramid sits Python. It's an interpreted, high-level language. "High-level" means Python handles a lot of the low-level details (like memory management) for you, so you can focus on the problem you're solving, not the machine you're running on. "Interpreted" means your Python code is translated and executed line-by-line at runtime, rather than being compiled into machine code beforehand. This makes it incredibly flexible and fast to iterate with, which is exactly what you want when building data pipelines.
🤓 The Curious Geek: Want to know about the difference between compiled and interpreted languages? Check out Compiled vs Interpreted Languages on freeCodeCamp.
Python was created by Dutch Programmer Guido van Rossum and first released in 1991. Named after Monty Python (not the snake!), it was designed to be readable and beginner-friendly while remaining powerful enough for complex applications.
Key milestones:
| Year | Event |
|---|---|
| 1991 | Python 0.9.0 released |
| 2000 | Python 2.0 - list comprehensions, garbage collection |
| 2008 | Python 3.0 - major backwards-incompatible release |
| 2020 | Python 2 officially sunset (end of life) |
| 2022 | Python 3.11 - significant performance improvements |
⚠️ Never use Python 2. It's been dead since January 2020. If you see python vs python3 on your system, always use python3. Some old tutorials still reference Python 2 - ignore them.
In data engineering, Python versions are critical because:
This is why we use virtual environments (covered below) and always specify Python versions.
🎬 Want more history? Watch Python: The Documentary (1h 30m) - interviews with Guido van Rossum and how Python became one of the world's most popular languages.
⚠️ Even if you have Python installed from the Core program, make sure you have Python 3.11 specifically. Data engineering projects often require specific Python versions for compatibility.
The recommended way to install Python on macOS is using Homebrew:
# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL <https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh>)"
# Install Python 3.11
brew install [email protected]
# Verify installation
python3.11 --version
# Python 3.11.12
python --version
# Python 3.11.12
sudo apt update
sudo apt install python3.11 python3.11-venv python3-pip
# Verify installation
python3.11 --version
# Python 3.11.12
💡 Virtual environments isolate your project's dependencies from other projects and your system Python. This is essential for reproducible data pipelines.
A virtual environment is like a clean room for your Python project. Each project gets its own isolated set of packages.
https://htmlpreview.github.io/?https://gist.githubusercontent.com/lassebenni/d1b6d921baa6621c0f53c73df53ce424/raw/virtual_env_animation.html
# Navigate to your project folder
cd my-data-project
# Create a virtual environment named 'venv'
python3.11 -m venv venv
# Activate it (macOS/Linux)
source venv/bin/activate
# Activate it (Windows PowerShell)
.\\\\\\\\venv\\\\\\\\Scripts\\\\\\\\Activate.ps1
# Your prompt should now show (venv)
With your virtual environment activated:
# Install a package
pip install pandas
# Install multiple packages
pip install pandas numpy
# Save your dependencies to a file
pip freeze > requirements.txt
# Install from requirements file (on another machine)
pip install -r requirements.txt
⌨️ Hands-on: Create a new folder called week1-practice, create a virtual environment in it, and install the requests package.
VS Code needs the Python extension to provide features like IntelliSense, debugging, and linting.
Cmd+Shift+X (macOS) or Ctrl+Shift+X (Windows/Linux) to open ExtensionsAfter creating a virtual environment, tell VS Code to use it:
Cmd+Shift+P (macOS) or Ctrl+Shift+P (Windows/Linux)venv folder💡 VS Code should automatically detect virtual environments in your workspace. Look for ./venv/bin/python or .\\venv\\Scripts\\python.exe in the list.
Add these to your VS Code settings (.vscode/settings.json):
{
"python.defaultInterpreterPath": "${workspaceFolder}/venv/bin/python",
"python.terminal.activateEnvironment": true,
"editor.formatOnSave": true,
"python.formatting.provider": "black"
}
venv) for each project?source venv/bin/activate do?Next lesson: Functions and Modules