Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
Most data in the real world lives behind APIs. Stock prices, weather forecasts, social media posts, payment transactions: if you want to ingest it, you need to make HTTP requests.
This chapter teaches you how to fetch data from REST APIs in Python, handle authentication, deal with pagination, and respect rate limits. By the end, you will fetch real weather data from the Open-Meteo API.
By the end of this chapter, you should be able to:
GET request with requests, including a timeout and raise_for_status().429 Too Many Requests response by honouring the Retry-After header and pacing future calls.When your browser loads a webpage, it sends an HTTP request to a server and gets back a response. Most public data APIs you ingest from in this week are REST APIs. API calls work the same way, except you get structured data (usually JSON) instead of HTML.
Recall the HTTP fundamentals from the Core program. Here is a quick reference focused on data ingestion:
| Concept | What It Means | Example |
|---|---|---|
| Method | What you want to do | GET (read data), POST (send data) |
| URL | Where the data lives | https://api.open-meteo.com/v1/forecast |
| Headers | Metadata about the request | Authorization: Bearer sk-abc123 |
| Query parameters | Filters for the data | ?latitude=55.67&longitude=12.56 |
| HTTP status code | Did it work? | 200 (yes), 404 (not found), 500 (server error) |
| Response body | The actual data | {"temperature": 18.5, ...} |
For data ingestion, you will use GET requests 95% of the time. You are reading data, not creating it.
requests LibraryPython's requests library makes HTTP calls simple.
import requests
response = requests.get(
"<https://api.open-meteo.com/v1/forecast>",
params={
"latitude": 55.67,
"longitude": 12.56,
"hourly": "temperature_2m",
},
timeout=10,
)
response.raise_for_status() # Raises exception for 4xx/5xx status codes
data = response.json() # Parse JSON response into a Python dict
Key things to notice:
params builds the query string for you (no manual ?latitude=55.67&longitude=12.56)timeout=10 (a timeout of 10 seconds) prevents your script from hanging forever if the server does not respondraise_for_status() turns HTTP errors into Python exceptions you can catch<aside>
💡 Always set a timeout. Without it, requests.get() will wait forever if the server never responds. Your pipeline will hang silently instead of failing cleanly.
</aside>
If you used fetch() in JavaScript during the Core program, the Python requests library works similarly:
<aside>
📘 Core Program Refresher: requests.get() is Python's equivalent of fetch() in JavaScript. Both return a response object you need to parse, and both have a .json() method to do it. The main difference: Python's requests is synchronous by default, while JavaScript's fetch() is async and returns a Promise.
</aside>
Many APIs require authentication. You already learned about public vs private APIs and auth tokens in the Core program, where you used Bearer tokens with fetch(). Here are the three most common patterns in Python:
import os
response = requests.get(
"<https://api.example.com/data>",
params={"api_key": os.environ["API_KEY"]},
timeout=10,
)
import os
response = requests.get(
"<https://api.example.com/data>",
headers={"Authorization": f"Bearer {os.environ['API_KEY']}"},
timeout=10,
)
Some APIs are free and open. The Open-Meteo API used in this week requires no key at all.
You will use the no-auth path for Open-Meteo this week; keep the two key-based patterns above as reference for when you ingest from an authenticated API later in your career.
Regardless of the method, never hardcode API keys in your source code. Use the .env + python-dotenv pattern from Configuration and Secrets to manage credentials securely.
<aside>
⌨️ Hands on: Fetch the current weather for your city using the Open-Meteo API. Use requests.get() with the URL https://api.open-meteo.com/v1/forecast, passing your city's latitude, longitude, and "hourly": "temperature_2m" as parameters. Print the first 5 temperature values.
</aside>
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=ingesting_apis&exercise=w3_ingesting_apis__fetch_weather&lang=python
</aside>
APIs rarely return all their data at once. When a dataset has thousands of records, the API splits it into pages. You need to loop through all the pages to get the complete dataset. This technique is called pagination.
The most common pattern. You specify which "page" of results you want.
def fetch_all_records(base_url: str) -> list[dict]:
"""Fetch all records from a paginated API using offset pagination."""
all_records = []
page = 1
page_size = 100
while True:
response = requests.get(
base_url,
params={"page": page, "per_page": page_size},
timeout=10,
)
response.raise_for_status()
data = response.json()
records = data["results"]
all_records.extend(records)
# Stop when the page has fewer records than requested (last page)
if len(records) < page_size:
break
page += 1
return all_records
Some APIs return a next_cursor token instead of page numbers. You pass the cursor to get the next batch.
def fetch_all_with_cursor(base_url: str) -> list[dict]:
"""Fetch all records using cursor-based pagination."""
all_records = []
cursor = None
while True:
params = {"limit": 100}
if cursor:
params["cursor"] = cursor
response = requests.get(base_url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
all_records.extend(data["results"])
cursor = data.get("next_cursor")
if not cursor:
break
return all_records
<aside>
💡 How do you know which pagination style an API uses? Read the API documentation. Look for parameters like page, offset, cursor, or next in the response.
</aside>
If you want to see these patterns running against real services, three short demo scripts mirror the loop shape above against three public APIs:
<aside> 📦 Pagination demos (≤40 lines each, no auth needed):
skip += limit with total from the body.next URL field.response.links["next"] (auto-parsed by requests).Each one prints a per-page summary so you can see the loop advance.
</aside>
APIs have rate limits, a maximum number of requests you can make per time period. If you exceed the limit, the API returns a 429 Too Many Requests response.
import time
import requests
def fetch_with_rate_limit(urls: list[str], delay: float = 0.5) -> list[dict]:
"""Fetch multiple URLs with a delay between requests."""
results = []
for url in urls:
response = requests.get(url, timeout=10)
if response.status_code == 429:
# Server says "slow down" - wait and retry
retry_after = int(response.headers.get("Retry-After", 5))
print(f"Rate limited, waiting {retry_after}s...")
time.sleep(retry_after)
response = requests.get(url, timeout=10)
response.raise_for_status()
results.append(response.json())
# Be polite: wait between requests even if not rate-limited
time.sleep(delay)
return results
Key rules:
Retry-After header when you get a 429Here is a complete function that fetches weather data using everything from this chapter. The retry-with-backoff loop (2 ** attempt) is explained in the next chapter; for now, read it as "try, sleep longer, try again":
import os
import time
import logging
import requests
logger = logging.getLogger(__name__)
def fetch_weather(
latitude: float,
longitude: float,
days: int = 7,
max_retries: int = 3,
) -> list[dict]:
"""Fetch hourly weather data from Open-Meteo API."""
url = "<https://api.open-meteo.com/v1/forecast>"
params = {
"latitude": latitude,
"longitude": longitude,
"hourly": "temperature_2m,relative_humidity_2m",
"forecast_days": days,
}
for attempt in range(max_retries):
try:
response = requests.get(url, params=params, timeout=10)
response.raise_for_status()
data = response.json()
# Transform API response into list of records
hourly = data["hourly"]
records = []
for i, timestamp in enumerate(hourly["time"]):
records.append({
"timestamp": timestamp,
"temperature_c": hourly["temperature_2m"][i],
"humidity_pct": hourly["relative_humidity_2m"][i],
"latitude": latitude,
"longitude": longitude,
})
logger.info(f"Fetched {len(records)} weather records")
return records
except requests.exceptions.RequestException as e:
wait_time = 2 ** attempt
logger.warning(f"Attempt {attempt + 1} failed: {e}, retrying in {wait_time}s")
time.sleep(wait_time)
logger.error("Failed to fetch weather data after all retries")
return []
This function combines:
<aside> 🤓 Curious Geek: REST vs GraphQL
REST APIs have separate URLs for each resource (/users, /products). You might need multiple requests to get related data. GraphQL APIs have a single endpoint where you specify exactly what data you want in the request body. GraphQL is popular with frontend developers, but most data engineering still uses REST APIs.
</aside>
The API patterns you just learned are standard across production ingestion systems.
<aside>
💡 In the wild: dlt (data load tool) is an open-source Python library used by data teams at hundreds of companies. Under the hood it wraps requests with the same timeout, retry, and pagination patterns you just built by hand. Browsing its REST API source connector shows how these manual patterns scale to a full production ingestion framework.
</aside>
The fastest way to lock in these patterns is to use them on a small problem.
timeout parameter when making HTTP requests?429 Too Many Requests response?params={...} argument to requests.get() better than building the URL string yourself?<aside> 🚀 Try it in the widget: Interactive Quiz: Ingesting from APIs
</aside>
https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch2_ingesting_apis_quiz&embed=1
Prefer a video walkthrough? Here is a beginner-friendly tour of requests:
<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:
</aside>