Joining and Merging DataFrames
Working with Strings and Dates
Going Further: Optional Deep Dives
While production data pipelines belong in .py modules, the initial phase of any data project is usually exploration: you need to hit an API, load a CSV, and see what the data actually looks like before you can design the system.
Jupyter Notebooks (.ipynb) are the industry-standard tool for this discovery phase.
By the end of this chapter, you should be able to run an exploratory notebook in VS Code, use cell-by-cell execution to inspect a DataFrame without reloading data, and explain when a notebook is the right tool versus a .py file.
In a standard Python script, if you want to see the result of a transformation, you have to run the entire script. If your script takes 30 seconds to load data from an API, you wait 30 seconds every time you change a single line of cleaning logic.
Notebooks solve this by breaking code into cells:
<aside> ⚠️ Cell execution order matters. A notebook's kernel remembers everything in the order you ran cells, not the order they appear on screen. If you run Cell 5 before Cell 3, variables from Cell 5 exist. If you then run Cell 3, the notebook may behave differently than a fresh top-to-bottom run. Always use Run All (or Restart Kernel and Run All) before treating a notebook's output as correct.
</aside>
<aside>
⚠️ The golden rule: Notebooks are for research. Scripts are for production. Use the notebook to meet your data, then graduate your successful code into your real .py modules.
</aside>
VS Code has built-in support for notebooks.
discovery.ipynb..venv or the global Python 3.12 installation.In your discovery.ipynb, try the following cells. The data is embedded directly so there is no external file to set up.
Cell 1: Import pandas and create a small DataFrame.
from io import StringIO
import pandas as pd
_csv = StringIO("""order_id,customer_id,region,amount,order_date
1,100,NL,120,2024-01-02
2,101,BE,90,2024-01-03
3,102,NL,,2024-01-03
4,103,DE,200,2024-01-04
5,100,NL,50,2024-01-05""")
orders = pd.read_csv(_csv)
orders
Cell 2: Run orders.head(). Notice how the table renders in the notebook instead of printing to a terminal.
Cell 3: Run orders.info() or orders.describe(). Re-run this cell without re-running Cell 1.
Cell 4: Tweak a column (for example orders['amount'] * 1.2) and run only that cell. The DataFrame from Cell 1 is still in memory.
<aside> 🤓 Curious Geek: Notebooks are JSON
If you open an .ipynb file in a plain text editor, you will see it is a large JSON file. It stores your code, your Markdown, and even the binary data for charts and tables. This is why notebooks grow large: always run "Clear All Outputs" before committing them to Git.
</aside>
The state-persistence model becomes intuitive once you try it: tweak the expression in Cell 4 a few times and notice you never wait for data to reload.
<aside> 🚀 Try it in the widget: Explore Orders DataFrame
</aside>
Notebooks also pair well with AI-assisted documentation.
<aside> 💡 Using AI to help: Ask an LLM to write Markdown documentation for a cell you have just run. Copy the cell and its output and paste it in: "Explain what this transformation does in professional documentation style." (⚠️ Ensure no PII or sensitive company data is included!)
</aside>
Test the state-persistence pattern: given a DataFrame already in memory, add a computed column without reloading data.
<aside> 🚀 Try it in the widget: Apply Column Markup
</aside>
.ipynb files and why notebooks became the default EDA tool across Python, R, and Julia..py file?.py module when it is ready for production?Test your recall before moving on.
<aside> 🚀 Try it in the widget: Interactive Quiz: Jupyter Notebooks
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_4_ch10_jupyter_notebooks_quiz&embed=1
If the notebook model felt unfamiliar, this walkthrough covers the same concepts from a different angle.
<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:
Jupyter Notebook Tutorial: Introduction, Setup, and Walkthrough
</aside>
https://www.youtube.com/watch?v=HW29067qVWk
The practice exercises use the same orders dataset you explored here. Open any exercise in the Codespace and try running it cell by cell.
<aside> ⌨️ Hands on: Practice with Exercise 1: Quick EDA on Orders.
</aside>
Next up: Practice, where you apply the Week 4 Pandas patterns in exercises.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.