This is your mid-track project. You have one week to design, build, and deploy a data pipeline that runs as a Container App Job on Azure and stores its results in Azure storage. You pick the data source and use case. The technical requirements and deadline are in the chapters below.
The week ends with a technical interview where you present your project and explain your decisions.
A complete data pipeline that:
This is the same architecture you built in Week 6, extended with a pandas transform step, a data source you choose yourself, and a daily CRON schedule so the pipeline runs automatically without you triggering it.
┌─── Azure Container App Job ──────────────────────────────────────────────────┐
│ External API ──► pipeline.py ──► Pydantic ──► pandas ──► Azure Storage │
└──────────────────────────────────────────────────────────────────────────────┘
<aside> 💡 In the wild: Open-source tools like dlt (data load tool) follow the same fetch-validate-store pattern you are building this week. Your project is a simplified version of what production data teams run every day.
</aside>
Pick something you find interesting. The best projects come from genuine curiosity. Your data source must be:
<aside> ⚠️ Verify your data source works on Day 1. Call the API, inspect the response, and confirm you can parse it. Do not discover on Day 4 that the API requires OAuth or returns HTML instead of JSON.
</aside>
These are starting points, not requirements. You can combine ideas or invent your own.
These return fresh data on every run: good for pipelines that run on a schedule.
| Project | Data Source | What to Store |
|---|---|---|
| Weather tracker | Open-Meteo API (no key needed) | Hourly forecasts for Amsterdam, Rotterdam, Utrecht |
| Eredivisie standings | Football-Data.org (free key, register with email) | Dutch league tables and match results |
| F1 race results | Ergast mirror (no key needed) | 2024 race results and Verstappen's championship season; swap 2024 for current to track the live season as it grows |
| Cryptocurrency prices | CoinGecko API (no key needed) | Price snapshots for top 10 coins in EUR |
| GitHub activity monitor | GitHub REST API (no key for public repos) | Commits and PR stats for a repo you follow |
| Space launches | Launch Library 2 (no key needed) | Upcoming launches with rocket, agency, and pad details |
These are snapshot CSV files you can add on top of your API pipeline as a bonus. They don't change between runs, so they work best as a second data source alongside your live API: load the CSV once to seed a reference table, then join or enrich your API data against it. Good for practising data cleaning with pandas when the point is the transformation, not the freshness.
| Project | Data Source | What's messy | What to Store |
|---|---|---|---|
| Amsterdam Airbnb listings | Inside Airbnb CSV | neighbourhood_group empty on most rows; last_review null for new listings; license often blank |
Listing name, neighbourhood, room type, price, availability |
| Dutch vehicles (RDW) | RDW Open Data (no key needed) | Field names in Dutch; dates as YYYYMMDD integers; "Geen verstrekking in Open Data" used as a null sentinel |
Licence plate, brand, model, APK expiry, colour, weight |
| World Cup matches | TidyTuesday CSV | winning_team is "NA" string on draws; win_conditions has inconsistent spacing ("3- 2" vs "3-2"); stage names are unnormalised |
Home/away teams, scores, outcome, stage, year |
| Eurovision history | TidyTuesday CSV | rank and total_points are "NA" strings (2020 cancelled); qualified is "TRUE"/"FALSE" text |
Country, artist, song, year, points, final rank |
| Netflix titles | TidyTuesday CSV | director empty for 30% of rows; date_added is "August 14, 2020" string; duration is "93 min" or "4 Seasons" (split into value + unit) |
Title, type, country, year added, rating, cleaned duration |
<aside> 💡 If you are stuck choosing, start with Open-Meteo: no API key, clean JSON, and the Week 6 examples already use it. A note on the others: Football-Data.org requires a free key (register at football-data.org, takes 2 minutes, store the key as an env var); the GitHub API allows 60 unauthenticated requests per hour, which is tight during development: create a free Personal Access Token to raise it to 5,000.
</aside>
You are not limited to this list. If you have a different API in mind, check with your teacher before you start building.
<aside> ⚠️ Using your own API? Your teacher will verify it has enough structure to warrant real Pydantic validation and pandas work. If the response is too flat or simple, your teacher may ask you to also add one of the bonus CSV datasets above so the transform step has enough to work with.
</aside>
Keep it focused. A working pipeline with one data source and one storage target is a better project than an ambitious plan that is half-finished.
Good scope:
Too ambitious for one week:
You can always add stretch goals after the core pipeline works end to end.
| Day | Milestone |
|---|---|
| 1 | Pick data source, verify API works, scaffold project structure, deploy hello-world container to Azure |
| 2-3 | Pipeline works locally: ingests, validates, transforms with pandas, stores in Postgres/Blob |
| 4 | Replace hello-world with real pipeline, push to ACR, create scheduled job (--trigger-type Schedule), trigger a manual run to verify |
| 5 | Polish, finalize README, prepare for technical interview |
<aside> ⌨️ Hands on: Deploy the hello-world container on Day 1, before your pipeline code is written. This proves your Azure setup works early. If you hit firewall issues or image pull errors, you have four days to fix them instead of four hours.
</aside>
When you are ready to start coding, see Chapter 2: Project Requirements for the starter template and the requirements checklist.
<aside>
💡 Using AI to help: Use an LLM to help you explore API documentation, generate Pydantic models from example JSON responses, or draft your README. Document what you used in AI_ASSIST.md. (⚠️ Ensure no PII or sensitive company data is included!)
</aside>
Next up: Project Requirements, where you will find the minimum requirements checklist and what a complete project run looks like.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.