Jupyter Notebooks

While production data pipelines belong in .py modules, the initial phase of any data project is usually exploration: you need to hit an API, load a CSV, and see what the data actually looks like before you can design the system.

Jupyter Notebooks (.ipynb) are the industry-standard tool for this discovery phase.

By the end of this chapter, you should be able to run an exploratory notebook in VS Code, use cell-by-cell execution to inspect a DataFrame without reloading data, and explain when a notebook is the right tool versus a .py file.

Concepts

Why use a Notebook?

In a standard Python script, if you want to see the result of a transformation, you have to run the entire script. If your script takes 30 seconds to load data from an API, you wait 30 seconds every time you change a single line of cleaning logic.

Notebooks solve this by breaking code into cells:

State persistence: You can load data in Cell 1, and it stays in memory. Run Cell 2 (a transformation) hundreds of times without reloading the data.
Rich output: Instead of the plain text output of a terminal, DataFrames are rendered as interactive HTML tables.
Documentation: Write Markdown cells alongside your code to explain your findings or document why a certain cleaning step was necessary.

<aside> ⚠️ Cell execution order matters. A notebook's kernel remembers everything in the order you ran cells, not the order they appear on screen. If you run Cell 5 before Cell 3, variables from Cell 5 exist. If you then run Cell 3, the notebook may behave differently than a fresh top-to-bottom run. Always use Run All (or Restart Kernel and Run All) before treating a notebook's output as correct.

</aside>

<aside> ⚠️ The golden rule: Notebooks are for research. Scripts are for production. Use the notebook to meet your data, then graduate your successful code into your real .py modules.

</aside>

Setting up in VS Code

VS Code has built-in support for notebooks.

Create a new file named discovery.ipynb.
VS Code will detect the extension and ask you to install the Jupyter Extension if it is missing.
Select a kernel: the Python environment that will run your code. Choose your .venv or the global Python 3.12 installation.

⌨️ Hands on: Visual Exploration

In your discovery.ipynb, try the following cells. The data is embedded directly so there is no external file to set up.

Cell 1: Import pandas and create a small DataFrame.

from io import StringIO
import pandas as pd

_csv = StringIO("""order_id,customer_id,region,amount,order_date
1,100,NL,120,2024-01-02
2,101,BE,90,2024-01-03
3,102,NL,,2024-01-03
4,103,DE,200,2024-01-04
5,100,NL,50,2024-01-05""")
orders = pd.read_csv(_csv)
orders

Cell 2: Run orders.head(). Notice how the table renders in the notebook instead of printing to a terminal.

Cell 3: Run orders.info() or orders.describe(). Re-run this cell without re-running Cell 1.

Cell 4: Tweak a column (for example orders['amount'] * 1.2) and run only that cell. The DataFrame from Cell 1 is still in memory.

<aside> 🤓 Curious Geek: Notebooks are JSON

If you open an .ipynb file in a plain text editor, you will see it is a large JSON file. It stores your code, your Markdown, and even the binary data for charts and tables. This is why notebooks grow large: always run "Clear All Outputs" before committing them to Git.

</aside>

The state-persistence model becomes intuitive once you try it: tweak the expression in Cell 4 a few times and notice you never wait for data to reload.

<aside> 🚀 Try it in the widget: Explore Orders DataFrame

</aside>

Notebooks also pair well with AI-assisted documentation.

<aside> 💡 Using AI to help: Ask an LLM to write Markdown documentation for a cell you have just run. Copy the cell and its output and paste it in: "Explain what this transformation does in professional documentation style." (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Test the state-persistence pattern: given a DataFrame already in memory, add a computed column without reloading data.

<aside> 🚀 Try it in the widget: Apply Column Markup

</aside>

Extra reading

Working with Jupyter Notebooks in Visual Studio Code: how to run, debug, and export notebooks in VS Code, including the variable explorer and data viewer.
Jupyter.org: The project mission: the project homepage explaining the open standards behind .ipynb files and why notebooks became the default EDA tool across Python, R, and Julia.

Knowledge Check

1. When should you use a notebook instead of a .py file?
1. What does it mean that a notebook "persists state"?
1. Why should you graduate code from a notebook to a .py module when it is ready for production?
1. What happens to the file size of a notebook if you leave a large chart or table in the output?