Production Error Handling

In Week 1, you learned to read tracebacks and use try/except to catch errors. In Week 2, you learned that except: pass is dangerous. But "don't silence errors" is only the beginning. When your pipeline talks to external systems, using the requests library you learned in Ingesting from APIs, errors are not bugs to fix: they are events to handle.

An API returns a 500 error. Do you crash? Retry? Skip? The answer depends on the type of error, and getting this wrong means either losing data or hammering a server that is already struggling.

This chapter teaches you how to handle errors like a production system, not a homework script.

By the end of this chapter, you should be able to:

Classify an HTTP or network error as transient or permanent and pick the right response for each.
Implement a retry loop with exponential backoff that gives up after a configurable number of attempts.
Apply the error accumulation pattern so a single bad record never crashes a batch.

Transient vs Permanent Errors

Not all errors are equal. The most important distinction in production error handling:

Transient errors are temporary. The server is busy, the network hiccuped, or the database is restarting. If you try again in a few seconds, it will probably work.

Permanent errors will never succeed no matter how many times you retry. The API key is wrong, the URL does not exist, or the data itself is invalid.

Type	Examples	What To Do
Transient	500 Server Error, 429 Too Many Requests, ConnectionTimeout	Retry with backoff
Permanent	401 Unauthorized, 404 Not Found, ValidationError	Log and skip

# BAD: treats all errors the same ❌
try:
    response = requests.get(url)
    data = response.json()
except Exception:
    print("Something went wrong")

# GOOD: distinguishes transient from permanent ✅
import requests

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    data = response.json()
except requests.exceptions.ConnectionError:
    # Transient: network issue, retry later
    print("Connection failed, will retry")
except requests.exceptions.HTTPError as e:
    if e.response.status_code == 429:
        # Transient: rate limited, wait and retry
        print("Rate limited, backing off")
    elif e.response.status_code == 401:
        # Permanent: bad credentials, stop trying
        print("Authentication failed, check API key")
        raise

<aside> 💡 A simple rule: if the same request could succeed later without any code changes, it is transient. If it will always fail, it is permanent.

</aside>

The try/except pattern you use here is the Python version of what you already know from JavaScript:

<aside> 📘 Core Program Refresher: Python's try/except is identical to JavaScript's try/catch. The syntax changes (except instead of catch, raise instead of throw), but the pattern is the same: attempt an operation, catch specific error types, and decide what to do.

</aside>

Retry with Exponential Backoff

In Week 1's Control Flow chapter, you wrote a basic retry loop with while and time.sleep(1). That was a fixed-delay retry. When a transient error occurs, you need something smarter.

<aside> 🎬 Animation: Exponential Backoff - Wait Longer Between Retries

</aside>

Fixed retry sends requests every 2 seconds. If the server is overloaded, you are making things worse by hammering it at a constant rate.

Exponential backoff waits longer between each retry: 1 second, then 2, then 4, then 8. This gives the server time to recover.

import time
import requests

def fetch_with_retry(url: str, max_retries: int = 3) -> dict:
    """Fetch a URL with exponential backoff on failure."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response.json()
        except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
            if attempt == max_retries - 1:
                raise  # Give up after final attempt
            wait_time = 2 ** attempt  # 1, 2, 4 seconds
            print(f"Attempt {attempt + 1} failed, retrying in {wait_time}s...")
            time.sleep(wait_time)

The pattern is simple:

Try the request
If it fails with a transient error, wait 2^attempt seconds
Try again
After max_retries attempts, give up and raise the error

<aside> ⚠️ Never retry permanent errors. Retrying a 401 Unauthorized 100 times will not make your API key correct. It will get your IP banned.

</aside>

Now it's time to practice implementing this retry logic yourself.

<aside> ⌨️ Hands on: Modify the fetch_with_retry function to also handle HTTP 429 (Too Many Requests) and HTTP 503 (Service Unavailable) as retryable errors. Hint: check response.status_code before calling raise_for_status().

</aside>

You can test your implementation directly in the browser:

<aside> 🚀 Try it in the widget: https://lasse.be/simple-hyf-teach-widget/?week=3&chapter=error_handling&exercise=w3_error_handling__retry_function&lang=python

</aside>

The Error Accumulation Pattern

In Week 2, your pipeline crashed on the first bad row. You saw a preview of error accumulation in Week 2's PipelineRunner from OOP vs Functional, which appended failures to an errors list instead of crashing. That made sense for local data: fix the file and re-run.

With external data, you cannot fix the source. Instead, you use the error accumulation pattern: process everything you can, collect the failures, and report them at the end.

# BAD: crashes on first bad record ❌
for record in raw_data:
    validated = WeatherReading(**record)  # Crashes if invalid
    save(validated)

# GOOD: accumulates errors, processes everything ✅
valid_records = []
error_records = []

for i, record in enumerate(raw_data):
    try:
        validated = WeatherReading(**record)
        valid_records.append(validated)
    except (ValueError, TypeError) as e:
        error_records.append({
            "index": i,
            "record": record,
            "error": str(e),
        })

print(f"Processed: {len(valid_records)} valid, {len(error_records)} errors")

This pattern is critical. In production, you might ingest 10,000 weather readings. If 3 are bad, you do not want to lose the other 9,997.

<aside> 💡 When debugging production pipelines, your error logs are your best friend. Include the error type, the failing record, and a timestamp in every log entry. Future-you will thank present-you.

</aside>

Putting It Together: A Robust Fetch Function

Here is how the patterns combine into a real ingestion function:

This function uses the logging module you learned in Week 1's Logging Basics instead of print(). If you need a refresher on log levels and logging.getLogger(__name__), revisit that chapter.

import time
import logging
import requests

logger = logging.getLogger(__name__)

RETRYABLE_STATUS_CODES = {429, 500, 502, 503, 504}

def fetch_weather_data(url: str, max_retries: int = 3) -> dict | None:
    """Fetch weather data with retry logic and error classification."""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)

            if response.status_code in RETRYABLE_STATUS_CODES:
                wait_time = 2 ** attempt
                logger.warning(
                    f"Got {response.status_code}, retry {attempt + 1}/{max_retries} "
                    f"in {wait_time}s"
                )
                time.sleep(wait_time)
                continue

            response.raise_for_status()
            return response.json()

        except requests.exceptions.ConnectionError:
            if attempt == max_retries - 1:
                logger.error(f"Connection failed after {max_retries} attempts: {url}")
                return None
            wait_time = 2 ** attempt
            logger.warning(f"Connection error, retrying in {wait_time}s...")
            time.sleep(wait_time)

        except requests.exceptions.HTTPError as e:
            # Permanent error, do not retry
            logger.error(f"Permanent error {e.response.status_code}: {url}")
            return None

    logger.error(f"All {max_retries} retries exhausted: {url}")
    return None

Notice:

Transient errors trigger retries with backoff
Permanent errors return None immediately (no retry)
Every failure is logged with enough detail to debug later
The function returns None instead of crashing, so the caller can continue

<aside> 💡 In the wild: The retry patterns in this chapter are built into production HTTP libraries. Check out urllib3 (which powers requests under the hood) and httpx: both implement retry logic internally.

</aside>

If you want to skip writing retry logic by hand, there is a library for that:

<aside> 🤓 Curious Geek: The tenacity Library

Writing retry logic by hand works for learning, but in production, use the tenacity library. It gives you decorators like @retry(stop=stop_after_attempt(3), wait=wait_exponential()) that handle all the retry patterns for you. Less code, fewer bugs.

</aside>

Before reaching for a library, get hands-on with the retry pattern yourself.

Knowledge Check

1. What is the difference between a transient and a permanent error? Give one example of each.
1. Why does exponential backoff wait longer between each retry instead of using a fixed delay?
1. Why should you accumulate validation errors instead of crashing on the first one?
1. Why is it dangerous to use except Exception: (or worse, except: alone) instead of catching specific exception classes?

<aside> 🚀 Try it in the widget: Interactive Quiz: Production Error Handling

</aside>

https://lasse.be/simple-hyf-teach-widget/mcq.html?bank=week_3_ch3_production_error_handling_quiz&embed=1

Prefer a video walkthrough? Here is a beginner-friendly take on try/except:

<aside> 🎬 Struggling with this concept? Watch this beginner-friendly video:

Watch on YouTube

</aside>

https://www.youtube.com/watch?v=NIWwJbo-9_8

Once try/except feels comfortable, an LLM can help you exercise the failure paths:

<aside> 💡 Using AI to help: Paste your retry logic (⚠️ Ensure no PII or sensitive company data is included!) into an LLM and ask: "How can I simulate a transient network error to test this locally?" It can help you write mock responses or use tools like responses to verify your error handling without waiting for the real API to fail.

</aside>

Extra reading

tenacity: Retrying library for Python - Production-ready retry decorators
Real Python: Python Exceptions - Deep dive into exception handling
AWS: Exponential Backoff and Jitter - Why jitter improves backoff
ArjanCodes: Exception Handling Tips - Video on Python error handling best practices

Ready to apply what you just read?

<aside> ⌨️ Hands on: Practice with Exercise 1: Write a Retry Function, where you turn a fragile fetch_data() call into a resilient retry loop with exponential backoff.

</aside>

Next up: Reading Multiple File Formats, where you ingest the same data from CSV, JSON, and Parquet and normalize the differences.