Azure Setup and Account Access
<aside> 🔀 🔍 View diff version (with change highlighting)
</aside>
In this section, you'll set up a professional Python development environment. Having a consistent, reproducible setup is crucial for data engineering work - it ensures your code runs the same way everywhere.
Python has become the lingua franca of the data world. Whether you are building complex data pipelines, training machine learning models, or automating cloud infrastructure, Python is likely the tool you'll use.
Python dominates the data landscape for three main reasons:
As a Data Engineer, you will use Python for:
By learning Python, you're not just learning a programming language; you're gaining the ability to orchestrate the entire lifecycle of data.
https://htmlpreview.github.io/?https://gist.githubusercontent.com/lassebenni/bacb6cda40cbe2207bdbccd348659ab3/raw/python_hierarchy.html
Think of programming languages as layers in a pyramid. At the very bottom is **hardware (**the physical CPU and RAM). Just above that is machine code, the raw 0s and 1s that your processor understands. Then comes assembly language, a slightly more human-readable form of machine code. Climbing higher, we find compiled languages like C, C++, and Rust, which translate your code into machine instructions before running.
At the top of this pyramid sits Python. It's an interpreted, high-level language. "High-level" means Python handles a lot of the low-level details (like memory management) for you, so you can focus on the problem you're solving, not the machine you're running on. "Interpreted" means your Python code is translated and executed line-by-line at runtime, rather than being compiled into machine code beforehand. This makes it incredibly flexible and fast to iterate with, which is exactly what you want when building data pipelines.
<aside> 🤓 The Curious Geek: Want to know about the difference between compiled and interpreted languages? Check out Compiled vs Interpreted Languages on freeCodeCamp.
</aside>
Python was created by Dutch Programmer Guido van Rossum and first released in 1991. Named after Monty Python (not the snake!), it was designed to be readable and beginner-friendly while remaining powerful enough for complex applications.
Key milestones:
| Year | Event |
|---|---|
| 1991 | Python 0.9.0 released |
| 2000 | Python 2.0 - list comprehensions, garbage collection |
| 2008 | Python 3.0 - major backwards-incompatible release |
| 2020 | Python 2 officially sunset (end of life) |
| 2022 | Python 3.11 - significant performance improvements |
<aside> ⚠️ Never use Python 2. It's been dead since January 2020. If you see python vs python3 on your system, always use python3. Some old tutorials still reference Python 2 - ignore them.
</aside>
In data engineering, Python versions are critical because:
This is why we use virtual environments (covered below) and always specify Python versions.
<aside> 🎬 Want more history? Watch Python: The Documentary (1h 30m) - interviews with Guido van Rossum and how Python became one of the world's most popular languages.
</aside>
<aside> ⚠️ Even if you have Python installed from the Core program, make sure you have Python 3.11 specifically. Data engineering projects often require specific Python versions for compatibility.
</aside>
The recommended way to install Python on macOS is using Homebrew:
# Install Homebrew if you don't have it
/bin/bash -c "$(curl -fsSL <https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh>)"
# Install Python 3.11
brew install [email protected]
# Verify installation
python3.11 --version
# Python 3.11.12
python --version
# Python 3.11.12
sudo apt update
sudo apt install python3.11 python3.11-venv python3-pip
# Verify installation
python3.11 --version
# Python 3.11.12
<aside> 💡 Virtual environments isolate your project's dependencies from other projects and your system Python. This is essential for reproducible data pipelines.
</aside>
A virtual environment is like a clean room for your Python project. Each project gets its own isolated set of packages.
https://htmlpreview.github.io/?https://gist.githubusercontent.com/lassebenni/d1b6d921baa6621c0f53c73df53ce424/raw/virtual_env_animation.html
# Navigate to your project folder
cd my-data-project
# Create a virtual environment named 'venv'
python3.11 -m venv venv
# Activate it (macOS/Linux)
source venv/bin/activate
# Activate it (Windows PowerShell)
.\\\\venv\\\\Scripts\\\\Activate.ps1
# Your prompt should now show (venv)
<aside>
💡 Pro Tip: The modern way with uv
While venv and pip are standard, many professional Data Engineers now use uv. It's an incredibly fast Python package manager written in Rust.
If you have uv installed, you can replace the steps above with:
uv venv # Create venv
uv pip install X # Install X
Or even better, use uv run script.py to run a script with its dependencies automatically! We use uv to manage this repository.
</aside>
<aside>
📘 Core program connection: In the Core program with JavaScript you used npm, package.json, and package-lock.json to install packages and keep versions reproducible. In Python, uv solves a very similar problem with pyproject.toml and uv.lock. The exact tools are different, but the goal is the same: every machine and CI runner should install the same dependency set. Refresh the Core program chapter here: https://www.notion.so/hackyourfuture/Package-managers-2b250f64ffc9800d8c76e5fec3aa8095
</aside>
With your virtual environment activated:
# Install a package
pip install pandas
# Install multiple packages
pip install pandas numpy
# Save your dependencies to a file
pip freeze > requirements.txt
# Install from requirements file (on another machine)
pip install -r requirements.txt
A package manager is a tool that installs and updates external libraries for your project. In Python, the classic tool is pip. Newer tools like uv do the same job, but also improve speed and reproducibility.
requirements.txt is a simple way to list direct dependencies. It works well and you will still see it in many Python projects and Dockerfiles.
uv goes further by using a lock file called uv.lock. That lock file pins not only the packages you chose directly, but also the packages those packages depend on. This makes installs more reproducible across laptops, CI, and production.
# Sync from pyproject.toml and uv.lock
uv sync
<aside> 💡 A lock file answers a practical question: "Will my teammate, the CI runner, and production install the exact same dependency tree?"
</aside>
<aside>
💡 You will still see requirements.txt in many real Python projects and in older codebases. It is widely used and worth recognizing. In this track, uv is the recommended route for new work because uv.lock gives stronger reproducibility.
</aside>
Week 5 will revisit this topic in more depth: what package managers do, why lock files matter, and when to choose requirements.txt or uv. See: Dependency Management
<aside> ⌨️ Hands-on: Create a new folder called week1-practice, create a virtual environment in it, and install the requests package.
</aside>

venv_demo.gif
VS Code needs the Python extension to provide features like IntelliSense, debugging, and linting.