Week 3 - Ingesting and Validating Data

Introduction to Data Ingestion

Ingesting from APIs

Production Error Handling

Reading Multiple File Formats

Data Validation with Pydantic

Writing to Databases

Practice

Assignment: Build a Validated Ingestion Pipeline

Gotchas & Pitfalls

Lesson Plan

⚠️ Gotchas & Pitfalls

In Week 2, you learned not to silence errors or hardcode paths. Week 3 introduces new sources of pain: APIs that return HTML instead of JSON, databases locked by external tools, and validation frameworks with terrifying error messages.

1. The JSONDecodeError Trap

The Misconception

You assume that because you are calling a JSON API, the response will always be JSON. When an error occurs, you call .json() to read the error message.

The Reality

When backend servers crash, or when cloud providers (like Cloudflare) block your request, they often return an HTML error page (like a 500 Internal Server Error or 502 Bad Gateway). Calling .json() on an HTML response instantly throws a requests.exceptions.JSONDecodeError. Your error handling crashes while trying to handle an error.

# BAD: Assuming all responses are JSON ❌
response = requests.get(url)
if response.status_code != 200:
    # Crashes here if the server returned an HTML error page!
    error_msg = response.json()["error"]
    print(f"Failed: {error_msg}")

# GOOD: Check the content type or catch the decode error ✅
response = requests.get(url)
if response.status_code != 200:
    try:
        error_msg = response.json().get("error", "Unknown error")
    except ValueError: # Catches JSONDecodeError
        error_msg = f"Server returned non-JSON response: {response.text[:100]}"
    print(f"Failed: {error_msg}")

<aside> 💡 Always verify the response format or wrap your .json() calls in a try-except block when analyzing failed requests.

</aside>


2. "Database is locked" (SQLite)

The Misconception

You wrote your database connection code perfectly using a context manager, but your Python script suddenly crashes with sqlite3.OperationalError: database is locked.

The Reality

SQLite supports multiple concurrent readers, but only one writer at a time. If you have your weather.db file open in a database viewer GUI (like DBeaver or DB Browser for SQLite) and you double-click a cell to edit it, the GUI acquires an exclusive write lock.

When you run your Python script while that lock is active in another window, Python waits a few seconds (the timeout default is usually 5 seconds) and then crashes with a locked error.

# BAD: Confused debugging ❌
# You stare at your code, wondering why it's locked when you used a Context Manager.

# GOOD: Close your database GUI connections ✅
# The bug isn't in your Python code. Close or disconnect
# your database viewer tool before running your ingestion pipeline.

<aside> 💡 If you get database is locked, the culprit is almost always a GUI tool you left open on another monitor.

</aside>


3. Panicking at Pydantic Tracebacks

The Misconception

When a data validation fails, Python will give you a simple, concise error message pointing to the exact line of code that is wrong.

The Reality

When data fails Pydantic validation, Pydantic raises a ValidationError. If you do not catch this exception, Python's default behavior will print the entire, massive stack trace to the terminal. For complex models or batch validations, a single failed field can produce 30+ lines of terrifying red terminal output. Beginners often panic and assume their entire script is broken.

# BAD: Letting the terrifying traceback crash the script ❌
reading = WeatherReading(**bad_data) # Explodes with 30 lines of red text

# GOOD: Catch and print the clean error dictionaries ✅
try:
    reading = WeatherReading(**bad_data)
except ValidationError as e:
    # e.errors() returns a clean list of dictionaries explaining exactly what failed
    for error in e.errors():
        field = error["loc"][0]
        msg = error["msg"]
        print(f"Validation failed for '{field}': {msg}")

<aside> 💡 Don't let the visual noise of a Pydantic traceback overwhelm you. Catch the exception and use e.errors() to read the clear, structured feedback inside.

</aside>


4. Naive vs. Aware Timestamps

The Misconception

A timestamp string is plain text. If you save "2025-01-15T10:00:00" to a database, you can confidently compare it later with any other timestamp.

The Reality

API data often comes with Timezone Offsets (e.g., "2025-01-15T10:00:00+02:00"). Local CSV data often comes without offsets ("2025-01-15T10:00:00"), which software calls "timezone naive". If you try to sort, filter, or validate these two formats against each other, they will behave unpredictably.

# BAD: Mixing naive and aware strings ❌
api_time = "2025-01-15T10:00:00+02:00" # Explicitly +2 hours
csv_time = "2025-01-15T10:00:00"       # Who knows? Local time? UTC?
# Trying to merge or compare these as plain strings will lead to silent bugs.

# GOOD: Enforcing a standard in your Validator ✅
# In Pydantic, you can force all datetime strings to be converted to UTC or stripped 
# of their timezone during ingestion, ensuring your database only holds unified data.

<aside> ⚠️ Pick a timezone strategy (usually UTC) and enforce it at the ingestion boundary. Never let mixed timestamp formats leak into your permanent storage.

</aside>

The timezone problem is not unique to data engineering:

<aside> 🤓 Curious Geek: The Y2K of Timezones

In 2007, the US changed its daylight saving time dates. Software that had hardcoded the old dates suddenly calculated wrong times for three weeks each year. The IANA Time Zone Database (used by Python's zoneinfo module) exists specifically to track these changes. Storing timestamps in UTC avoids this class of bugs entirely.

</aside>


🧠 Knowledge Check

  1. Why is it dangerous to call response.json() immediately after detecting a 500 Internal Server error?
  2. You used a context manager (with sqlite3.connect...) but still got a database is locked error. What is the most likely cause?
  3. Which method on a Pydantic ValidationError gives you a clean, structured list of the exact fields that failed?

<aside> 💡 Using AI to help: When you hit one of these gotchas in your own code, paste the error and the relevant code snippet (⚠️ Ensure no PII or sensitive company data is included!) into an LLM. Ask it to identify which gotcha applies and suggest the fix. This is a fast way to build pattern recognition for common pitfalls.

</aside>

Extra reading


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.