Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Practice

Assignment: Build a Validated Ingestion Pipeline

Gotchas & Pitfalls

Lesson Plan

🛡️ Production Error Handling

In Week 1, you learned to read tracebacks and use try/except to catch errors. In Week 2, you learned that except: pass is dangerous. But "don't silence errors" is only the beginning. When your pipeline talks to external systems, using the requests library you learned in Chapter 2, errors are not bugs to fix: they are events to handle.

An API returns a 500 error. Do you crash? Retry? Skip? The answer depends on the type of error, and getting this wrong means either losing data or hammering a server that is already struggling.

This chapter teaches you how to handle errors like a production system, not a homework script.

Transient vs Permanent Errors

Not all errors are equal. The most important distinction in production error handling:

Transient errors are temporary. The server is busy, the network hiccuped, or the database is restarting. If you try again in a few seconds, it will probably work.

Permanent errors will never succeed no matter how many times you retry. The API key is wrong, the URL does not exist, or the data itself is invalid.

Type Examples What To Do
Transient 500 Server Error, 429 Too Many Requests, ConnectionTimeout Retry with backoff
Permanent 401 Unauthorized, 404 Not Found, ValidationError Log and skip
# BAD: treats all errors the same ❌
try:
    response = requests.get(url)
    data = response.json()
except Exception:
    print("Something went wrong")
# GOOD: distinguishes transient from permanent ✅
import requests

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    data = response.json()
except requests.exceptions.ConnectionError:
    # Transient: network issue, retry later
    print("Connection failed, will retry")
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 429:
        # Transient: rate limited, wait and retry
        print("Rate limited, backing off")
    elif e.response.status_code == 401:
        # Permanent: bad credentials, stop trying
        print("Authentication failed, check API key")
        raise

<aside> 💡 A simple rule: if the same request could succeed later without any code changes, it is transient. If it will always fail, it is permanent.

</aside>

The try/except pattern you use here is the Python version of what you already know from JavaScript:

<aside> 📘 Core Program Refresher: Python's try/except is identical to JavaScript's try/catch. The syntax changes (except instead of catch, raise instead of throw), but the pattern is the same: attempt an operation, catch specific error types, and decide what to do.

</aside>

Retry with Exponential Backoff

In Week 1 (Chapter 3), you wrote a basic retry loop with while and time.sleep(1). That was a fixed-delay retry. When a transient error occurs, you need something smarter.

<aside> 🎬 Animation: Exponential Backoff - Wait Longer Between Retries

</aside>

Fixed retry sends requests every 2 seconds. If the server is overloaded, you are making things worse by hammering it at a constant rate.

Exponential backoff waits longer between each retry: 1 second, then 2, then 4, then 8. This gives the server time to recover.

import time
import requests

def fetch_with_retry(url: str, max_retries: int = 3) -> dict:
    """Fetch a URL with exponential backoff on failure."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response.json()
        except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
            if attempt == max_retries - 1:
                raise  # Give up after final attempt
            wait_time = 2 ** attempt  # 1, 2, 4 seconds
            print(f"Attempt {attempt + 1} failed, retrying in {wait_time}s...")
            time.sleep(wait_time)

The pattern is simple:

  1. Try the request
  2. If it fails with a transient error, wait 2^attempt seconds
  3. Try again
  4. After max_retries attempts, give up and raise the error

<aside> ⚠️ Never retry permanent errors. Retrying a 401 Unauthorized 100 times will not make your API key correct. It will get your IP banned.

</aside>

Now it's time to practice implementing this retry logic yourself.

<aside> ⌨️ Hands on: Modify the fetch_with_retry function to also handle HTTP 429 (Too Many Requests) and HTTP 503 (Service Unavailable) as retryable errors. Hint: check response.status_code before calling raise_for_status().

</aside>

You can test your implementation directly in the browser:

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=error_handling&exercise=w3_error_handling__retry_function&lang=python

</aside>

The Error Accumulation Pattern

In Week 2, your pipeline crashed on the first bad row. You saw a preview of error accumulation in Week 2's PipelineRunner (Chapter 4), which appended failures to an errors list instead of crashing. That made sense for local data: fix the file and re-run.

With external data, you cannot fix the source. Instead, you accumulate errors: process everything you can, collect the failures, and report them at the end.

# BAD: crashes on first bad record ❌
for record in raw_data:
    validated = WeatherReading(**record)  # Crashes if invalid
    save(validated)
# GOOD: accumulates errors, processes everything ✅
valid_records = []
error_records = []

for i, record in enumerate(raw_data):
    try:
        validated = WeatherReading(**record)
        valid_records.append(validated)
    except (ValueError, TypeError) as e:
        error_records.append({
            "index": i,
            "record": record,
            "error": str(e),
        })

print(f"Processed: {len(valid_records)} valid, {len(error_records)} errors")

This pattern is critical. In production, you might ingest 10,000 weather readings. If 3 are bad, you do not want to lose the other 9,997.

<aside> 💡 When debugging production pipelines, your error logs are your best friend. Include the error type, the failing record, and a timestamp in every log entry. Future-you will thank present-you.

</aside>

Putting It Together: A Robust Fetch Function

Here is how the patterns combine into a real ingestion function:

This function uses the logging module you learned in Week 1 (Chapter 8) instead of print(). If you need a refresher on log levels and logging.getLogger(__name__), revisit that chapter.

import time
import logging
import requests

logger = logging.getLogger(__name__)

RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}

def fetch_weather_data(url: str, max_retries: int = 3) -> dict | None:
    """Fetch weather data with retry logic and error classification."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)

            if response.status_code in RETRYABLE_STATUS_CODES:
                wait_time = 2 ** attempt
                logger.warning(
                    f"Got {response.status_code}, retry {attempt + 1}/{max_retries} "
                    f"in {wait_time}s"
                )
                time.sleep(wait_time)
                continue

            response.raise_for_status()
            return response.json()

        except requests.exceptions.ConnectionError:
            if attempt == max_retries - 1:
                logger.error(f"Connection failed after {max_retries} attempts: {url}")
                return None
            wait_time = 2 ** attempt
            logger.warning(f"Connection error, retrying in {wait_time}s...")
            time.sleep(wait_time)

        except requests.exceptions.HTTPError as e:
            # Permanent error, do not retry
            logger.error(f"Permanent error {e.response.status_code}: {url}")
            return None

    logger.error(f"All {max_retries} retries exhausted: {url}")
    return None

Notice:

<aside> 💡 In the wild: The retry patterns in this chapter are built into production HTTP libraries. Check out urllib3 (which powers requests under the hood) and httpx: both implement retry logic internally.

</aside>

If you want to skip writing retry logic by hand, there is a library for that:

<aside> 🤓 Curious Geek: The tenacity Library

Writing retry logic by hand works for learning, but in production, use the tenacity library. It gives you decorators like @retry(stop=stop_after_attempt(3), wait=wait_exponential()) that handle all the retry patterns for you. Less code, fewer bugs.

</aside>

🧠 Knowledge Check

  1. What is the difference between a transient and a permanent error? Give one example of each.
  2. Why does exponential backoff wait longer between each retry instead of using a fixed delay?
  3. Why should you accumulate validation errors instead of crashing on the first one?

<aside> 💡 Using AI to help: Paste your retry logic (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "How can I simulate a transient network error to test this locally?" It can help you write mock responses or use tools like responses to verify your error handling without waiting for the real API to fail.

</aside>

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.