Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
In Week 1, you learned to read tracebacks and use try/except to catch errors. In Week 2, you learned that except: pass is dangerous. But "don't silence errors" is only the beginning. When your pipeline talks to external systems, using the requests library you learned in Chapter 2, errors are not bugs to fix: they are events to handle.
An API returns a 500 error. Do you crash? Retry? Skip? The answer depends on the type of error, and getting this wrong means either losing data or hammering a server that is already struggling.
This chapter teaches you how to handle errors like a production system, not a homework script.
Not all errors are equal. The most important distinction in production error handling:
Transient errors are temporary. The server is busy, the network hiccuped, or the database is restarting. If you try again in a few seconds, it will probably work.
Permanent errors will never succeed no matter how many times you retry. The API key is wrong, the URL does not exist, or the data itself is invalid.
| Type | Examples | What To Do |
|---|---|---|
| Transient | 500 Server Error, 429 Too Many Requests, ConnectionTimeout | Retry with backoff |
| Permanent | 401 Unauthorized, 404 Not Found, ValidationError | Log and skip |
# BAD: treats all errors the same ❌
try:
response = requests.get(url)
data = response.json()
except Exception:
print("Something went wrong")
# GOOD: distinguishes transient from permanent ✅
import requests
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
data = response.json()
except requests.exceptions.ConnectionError:
# Transient: network issue, retry later
print("Connection failed, will retry")
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
# Transient: rate limited, wait and retry
print("Rate limited, backing off")
elif e.response.status_code == 401:
# Permanent: bad credentials, stop trying
print("Authentication failed, check API key")
raise
<aside> 💡 A simple rule: if the same request could succeed later without any code changes, it is transient. If it will always fail, it is permanent.
</aside>
The try/except pattern you use here is the Python version of what you already know from JavaScript:
<aside>
📘 Core Program Refresher: Python's try/except is identical to JavaScript's try/catch. The syntax changes (except instead of catch, raise instead of throw), but the pattern is the same: attempt an operation, catch specific error types, and decide what to do.
</aside>
In Week 1 (Chapter 3), you wrote a basic retry loop with while and time.sleep(1). That was a fixed-delay retry. When a transient error occurs, you need something smarter.
<aside> 🎬 Animation: Exponential Backoff - Wait Longer Between Retries
</aside>
Fixed retry sends requests every 2 seconds. If the server is overloaded, you are making things worse by hammering it at a constant rate.
Exponential backoff waits longer between each retry: 1 second, then 2, then 4, then 8. This gives the server time to recover.
import time
import requests
def fetch_with_retry(url: str, max_retries: int = 3) -> dict:
"""Fetch a URL with exponential backoff on failure."""
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.json()
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
if attempt == max_retries - 1:
raise # Give up after final attempt
wait_time = 2 ** attempt # 1, 2, 4 seconds
print(f"Attempt {attempt + 1} failed, retrying in {wait_time}s...")
time.sleep(wait_time)
The pattern is simple:
2^attempt secondsmax_retries attempts, give up and raise the error<aside> ⚠️ Never retry permanent errors. Retrying a 401 Unauthorized 100 times will not make your API key correct. It will get your IP banned.
</aside>
Now it's time to practice implementing this retry logic yourself.
<aside>
⌨️ Hands on: Modify the fetch_with_retry function to also handle HTTP 429 (Too Many Requests) and HTTP 503 (Service Unavailable) as retryable errors. Hint: check response.status_code before calling raise_for_status().
</aside>
You can test your implementation directly in the browser:
<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=error_handling&exercise=w3_error_handling__retry_function&lang=python
</aside>
In Week 2, your pipeline crashed on the first bad row. You saw a preview of error accumulation in Week 2's PipelineRunner (Chapter 4), which appended failures to an errors list instead of crashing. That made sense for local data: fix the file and re-run.
With external data, you cannot fix the source. Instead, you accumulate errors: process everything you can, collect the failures, and report them at the end.
# BAD: crashes on first bad record ❌
for record in raw_data:
validated = WeatherReading(**record) # Crashes if invalid
save(validated)
# GOOD: accumulates errors, processes everything ✅
valid_records = []
error_records = []
for i, record in enumerate(raw_data):
try:
validated = WeatherReading(**record)
valid_records.append(validated)
except (ValueError, TypeError) as e:
error_records.append({
"index": i,
"record": record,
"error": str(e),
})
print(f"Processed: {len(valid_records)} valid, {len(error_records)} errors")
This pattern is critical. In production, you might ingest 10,000 weather readings. If 3 are bad, you do not want to lose the other 9,997.
<aside> 💡 When debugging production pipelines, your error logs are your best friend. Include the error type, the failing record, and a timestamp in every log entry. Future-you will thank present-you.
</aside>
Here is how the patterns combine into a real ingestion function:
This function uses the logging module you learned in Week 1 (Chapter 8) instead of print(). If you need a refresher on log levels and logging.getLogger(__name__), revisit that chapter.
import time
import logging
import requests
logger = logging.getLogger(__name__)
RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}
def fetch_weather_data(url: str, max_retries: int = 3) -> dict | None:
"""Fetch weather data with retry logic and error classification."""
for attempt in range(max_retries):
try:
response = requests.get(url, timeout=10)
if response.status_code in RETRYABLE_STATUS_CODES:
wait_time = 2 ** attempt
logger.warning(
f"Got {response.status_code}, retry {attempt + 1}/{max_retries} "
f"in {wait_time}s"
)
time.sleep(wait_time)
continue
response.raise_for_status()
return response.json()
except requests.exceptions.ConnectionError:
if attempt == max_retries - 1:
logger.error(f"Connection failed after {max_retries} attempts: {url}")
return None
wait_time = 2 ** attempt
logger.warning(f"Connection error, retrying in {wait_time}s...")
time.sleep(wait_time)
except requests.exceptions.HTTPError as e:
# Permanent error, do not retry
logger.error(f"Permanent error {e.response.status_code}: {url}")
return None
logger.error(f"All {max_retries} retries exhausted: {url}")
return None
Notice:
None immediately (no retry)None instead of crashing, so the caller can continue<aside>
💡 In the wild: The retry patterns in this chapter are built into production HTTP libraries. Check out urllib3 (which powers requests under the hood) and httpx: both implement retry logic internally.
</aside>
If you want to skip writing retry logic by hand, there is a library for that:
<aside>
🤓 Curious Geek: The tenacity Library
Writing retry logic by hand works for learning, but in production, use the tenacity library. It gives you decorators like @retry(stop=stop_after_attempt(3), wait=wait_exponential()) that handle all the retry patterns for you. Less code, fewer bugs.
</aside>
<aside>
💡 Using AI to help: Paste your retry logic (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "How can I simulate a transient network error to test this locally?" It can help you write mock responses or use tools like responses to verify your error handling without waiting for the real API to fail.
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.