Week 7 - Mid-Track Project

Project Brief

Project Requirements

Week 7 Gotchas & Pitfalls

Project Brief

This is your mid-track project. You have one week to design, build, and deploy a data pipeline that runs as a Container App Job on Azure and stores its results in Azure storage. You pick the data source and use case. The technical requirements and deadline are in the chapters below.

The week ends with a technical interview where you present your project and explain your decisions.

What you are building

A complete data pipeline that:

  1. Ingests data from a live external API
  2. Validates the data before storing it (Pydantic)
  3. Transforms the validated data using pandas before storage
  4. Stores results in Azure Postgres (rows) and Azure Blob Storage (raw JSON)
  5. Runs as a Container App Job on Azure on a daily schedule

This is the same architecture you built in Week 6, extended with a pandas transform step, a data source you choose yourself, and a daily CRON schedule so the pipeline runs automatically without you triggering it.

┌─── Azure Container App Job ──────────────────────────────────────────────────┐
│  External API ──► pipeline.py ──► Pydantic ──► pandas ──► Azure Storage      │
└──────────────────────────────────────────────────────────────────────────────┘

<aside> 💡 In the wild: Open-source tools like dlt (data load tool) follow the same fetch-validate-store pattern you are building this week. Your project is a simplified version of what production data teams run every day.

</aside>

Choosing your data source

Pick something you find interesting. The best projects come from genuine curiosity. Your data source must be:

<aside> ⚠️ Verify your data source works on Day 1. Call the API, inspect the response, and confirm you can parse it. Do not discover on Day 4 that the API requires OAuth or returns HTML instead of JSON.

</aside>

Example project ideas

These are starting points, not requirements. You can combine ideas or invent your own.

Live APIs

These return fresh data on every run: good for pipelines that run on a schedule.

Project Data Source What to Store
Weather tracker Open-Meteo API (no key needed) Hourly forecasts for Amsterdam, Rotterdam, Utrecht
Eredivisie standings Football-Data.org (free key, register with email) Dutch league tables and match results
F1 race results Ergast mirror (no key needed) 2024 race results and Verstappen's championship season; swap 2024 for current to track the live season as it grows
Cryptocurrency prices CoinGecko API (no key needed) Price snapshots for top 10 coins in EUR
GitHub activity monitor GitHub REST API (no key for public repos) Commits and PR stats for a repo you follow
Space launches Launch Library 2 (no key needed) Upcoming launches with rocket, agency, and pad details

Static datasets (bonus)

These are snapshot CSV files you can add on top of your API pipeline as a bonus. They don't change between runs, so they work best as a second data source alongside your live API: load the CSV once to seed a reference table, then join or enrich your API data against it. Good for practising data cleaning with pandas when the point is the transformation, not the freshness.

Project Data Source What's messy What to Store
Amsterdam Airbnb listings Inside Airbnb CSV neighbourhood_group empty on most rows; last_review null for new listings; license often blank Listing name, neighbourhood, room type, price, availability
Dutch vehicles (RDW) RDW Open Data (no key needed) Field names in Dutch; dates as YYYYMMDD integers; "Geen verstrekking in Open Data" used as a null sentinel Licence plate, brand, model, APK expiry, colour, weight
World Cup matches TidyTuesday CSV winning_team is "NA" string on draws; win_conditions has inconsistent spacing ("3- 2" vs "3-2"); stage names are unnormalised Home/away teams, scores, outcome, stage, year
Eurovision history TidyTuesday CSV rank and total_points are "NA" strings (2020 cancelled); qualified is "TRUE"/"FALSE" text Country, artist, song, year, points, final rank
Netflix titles TidyTuesday CSV director empty for 30% of rows; date_added is "August 14, 2020" string; duration is "93 min" or "4 Seasons" (split into value + unit) Title, type, country, year added, rating, cleaned duration

<aside> 💡 If you are stuck choosing, start with Open-Meteo: no API key, clean JSON, and the Week 6 examples already use it. A note on the others: Football-Data.org requires a free key (register at football-data.org, takes 2 minutes, store the key as an env var); the GitHub API allows 60 unauthenticated requests per hour, which is tight during development: create a free Personal Access Token to raise it to 5,000.

</aside>

You are not limited to this list. If you have a different API in mind, check with your teacher before you start building.

<aside> ⚠️ Using your own API? Your teacher will verify it has enough structure to warrant real Pydantic validation and pandas work. If the response is too flat or simple, your teacher may ask you to also add one of the bonus CSV datasets above so the transform step has enough to work with.

</aside>

Scope guidance

Keep it focused. A working pipeline with one data source and one storage target is a better project than an ambitious plan that is half-finished.

Good scope:

Too ambitious for one week:

You can always add stretch goals after the core pipeline works end to end.

Timeline

Day Milestone
1 Pick data source, verify API works, scaffold project structure, deploy hello-world container to Azure
2-3 Pipeline works locally: ingests, validates, transforms with pandas, stores in Postgres/Blob
4 Replace hello-world with real pipeline, push to ACR, create scheduled job (--trigger-type Schedule), trigger a manual run to verify
5 Polish, finalize README, prepare for technical interview

<aside> ⌨️ Hands on: Deploy the hello-world container on Day 1, before your pipeline code is written. This proves your Azure setup works early. If you hit firewall issues or image pull errors, you have four days to fix them instead of four hours.

</aside>

When you are ready to start coding, see Chapter 2: Project Requirements for the starter template and the requirements checklist.

<aside> 💡 Using AI to help: Use an LLM to help you explore API documentation, generate Pydantic models from example JSON responses, or draft your README. Document what you used in AI_ASSIST.md. (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Extra reading


Next up: Project Requirements, where you will find the minimum requirements checklist and what a complete project run looks like.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.