Week 4: Data Processing

Pandas and DataFrames

Selecting and Filtering Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Jupyter Notebooks

Practice

Assignment: MessyCorp Pandas

Gotchas & Pitfalls

Week 4 Kickoff Slides

Career relevance: Week 4

Pandas Cheatsheet

Week 4 Glossary

Going Further: Optional Deep Dives

Jupyter Notebooks

While production data pipelines belong in .py modules, the initial phase of any data project is usually exploration: you need to hit an API, load a CSV, and see what the data actually looks like before you can design the system.

Jupyter Notebooks (.ipynb) are the industry-standard tool for this discovery phase.

By the end of this chapter, you should be able to run an exploratory notebook in VS Code, use cell-by-cell execution to inspect a DataFrame without reloading data, and explain when a notebook is the right tool versus a .py file.

Concepts

Why use a Notebook?

In a standard Python script, if you want to see the result of a transformation, you have to run the entire script. If your script takes 30 seconds to load data from an API, you wait 30 seconds every time you change a single line of cleaning logic.

Notebooks solve this by breaking code into cells:

<aside> ⚠️ Cell execution order matters. A notebook's kernel remembers everything in the order you ran cells, not the order they appear on screen. If you run Cell 5 before Cell 3, variables from Cell 5 exist. If you then run Cell 3, the notebook may behave differently than a fresh top-to-bottom run. Always use Run All (or Restart Kernel and Run All) before treating a notebook's output as correct.

</aside>

<aside> ⚠️ The golden rule: Notebooks are for research. Scripts are for production. Use the notebook to meet your data, then graduate your successful code into your real .py modules.

</aside>

Setting up in VS Code

VS Code has built-in support for notebooks.

  1. Create a new file named discovery.ipynb.
  2. VS Code will detect the extension and ask you to install the Jupyter Extension if it is missing.
  3. Select a kernel: the Python environment that will run your code. Choose your .venv or the global Python 3.12 installation.

⌨️ Hands on: Visual Exploration

In your discovery.ipynb, try the following cells. The data is embedded directly so there is no external file to set up.

Cell 1: Import pandas and create a small DataFrame.

from io import StringIO
import pandas as pd

_csv = StringIO("""order_id,customer_id,region,amount,order_date
1,100,NL,120,2024-01-02
2,101,BE,90,2024-01-03
3,102,NL,,2024-01-03
4,103,DE,200,2024-01-04
5,100,NL,50,2024-01-05""")
orders = pd.read_csv(_csv)
orders

Cell 2: Run orders.head(). Notice how the table renders in the notebook instead of printing to a terminal.

Cell 3: Run orders.info() or orders.describe(). Re-run this cell without re-running Cell 1.

Cell 4: Tweak a column (for example orders['amount'] * 1.2) and run only that cell. The DataFrame from Cell 1 is still in memory.

<aside> 🤓 Curious Geek: Notebooks are JSON

If you open an .ipynb file in a plain text editor, you will see it is a large JSON file. It stores your code, your Markdown, and even the binary data for charts and tables. This is why notebooks grow large: always run "Clear All Outputs" before committing them to Git.

</aside>

The state-persistence model becomes intuitive once you try it: tweak the expression in Cell 4 a few times and notice you never wait for data to reload.

<aside> 🚀 Try it in the widget: Explore Orders DataFrame

</aside>

Notebooks also pair well with AI-assisted documentation.

<aside> 💡 Using AI to help: Ask an LLM to write Markdown documentation for a cell you have just run. Copy the cell and its output and paste it in: "Explain what this transformation does in professional documentation style." (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Test the state-persistence pattern: given a DataFrame already in memory, add a computed column without reloading data.

<aside> 🚀 Try it in the widget: Apply Column Markup

</aside>

Extra reading

Knowledge Check

Test your recall before moving on.

<aside> 🚀 Try it in the widget: Interactive Quiz: Jupyter Notebooks

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch10_jupyter_notebooks_quiz&embed=1

If the notebook model felt unfamiliar, this walkthrough covers the same concepts from a different angle.

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

Jupyter Notebook Tutorial: Introduction, Setup, and Walkthrough

</aside>

https://www.youtube.com/watch?v=HW29067qVWk

The practice exercises use the same orders dataset you explored here. Open any exercise in the Codespace and try running it cell by cell.

<aside> ⌨️ Hands on: Practice with Exercise 1: Quick EDA on Orders.

</aside>


Next up: Practice, where you apply the Week 4 Pandas patterns in exercises.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.