Project Requirements

This chapter lists exactly what your project must include to pass, what goes beyond the minimum, and how the assessment works. Use the checklists to track your progress.

Minimum requirements

Your project must include all of the following. Missing any item means the project is incomplete.

Data pipeline

[ ] Uses a live external API as the data source
[ ] Validates incoming data before storage (Pydantic models or equivalent)
[ ] Uses pandas to transform or process data before storage (at minimum: parse timestamps, handle nulls, flatten nested fields, or derive a new column; creating a DataFrame and immediately storing it does not count)
[ ] Creates a table in Azure Postgres and writes rows to it
[ ] Writes raw or transformed data to Azure Blob Storage
[ ] Uses logging instead of print for status messages
[ ] Fails fast if required environment variables are missing

Containerization

[ ] Has a Dockerfile that builds without errors
[ ] Runs locally with docker run and produces correct output
[ ] Includes a .env.example listing required environment variables (without actual values)
[ ] Uses pinned dependency versions (pyproject.toml + uv.lock, or requirements.txt with pinned versions)

Testing

[ ] At least 2 pytest tests (e.g., Pydantic model accepts valid data, rejects invalid data)
[ ] Tests pass locally with pytest tests/ -v

CI/CD

[ ] GitHub Actions workflow that runs linting, formatting, and tests on push
[ ] Docker image pushed to Azure Container Registry (manually or through CI)

Azure deployment

[ ] Image exists in ACR with a tagged version
[ ] Container App Job created in the shared environment
[ ] Job runs successfully and produces verifiable output (execution history shows Succeeded, rows in Postgres and blobs in Blob Storage)
[ ] Job uses --trigger-type Schedule --cron-expression "0 6 * * *", --registry-server, --replica-timeout 300, and --env-vars (covered in Week 6 Chapter 5)

<aside> ⚠️ If Azure resources are unavailable (Postgres unreachable, ACR quota exceeded, Container Apps environment missing), contact your teacher immediately. Do not wait until Day 4 to discover a shared infrastructure issue.

</aside>

Documentation

[ ] README.md with: what the pipeline does, how to run locally, how to trigger in Azure, how to verify results
[ ] AI_ASSIST.md documenting your AI usage (prompts, outputs, what you changed)
[ ] Architecture overview: a short description or diagram of the data flow

Git workflow

[ ] Never pushed directly to main
[ ] Opened and merged at least 3 pull requests during the week (e.g., feat/pipeline, feat/docker, feat/azure-deploy)
[ ] Commit messages describe what changed and why

Cleanup

[ ] Container App Job deleted after the project is evaluated
[ ] No secrets committed to git (connection strings, API keys, passwords)

<aside> ⌨️ Hands on: Before you start coding, copy the minimum requirements checklist above into a file or issue tracker. Check off each item as you complete it. This prevents the "I thought I was done but forgot CI" moment on the last day.

</aside>

Bonus

These are optional but demonstrate deeper understanding:

Error handling: retry failed API calls, send a notification on failure
Static dataset enrichment: load one of the messy CSV files from Chapter 1 as a reference table and join or enrich your API data against it
Aggregated summary table: persist both raw rows AND an aggregated summary (e.g. daily averages or totals) as a second Postgres table
Monitoring: add a health check or log-based alert

Project structure

Use a clear layout. Here is a recommended structure:

week7-project/
├── src/
│   ├── pipeline.py        # Main pipeline logic
│   ├── models.py          # Pydantic validation models
│   └── storage.py         # Database/blob storage functions
├── tests/
│   └── test_models.py     # Pydantic model tests
├── .github/
│   └── workflows/
│       └── ci.yml         # Linting, formatting, and tests
├── conftest.py            # Makes src/ importable when running pytest
├── Dockerfile
├── pyproject.toml
├── uv.lock
├── .env.example
├── .gitignore
├── README.md
└── AI_ASSIST.md

Getting started with the template

A starter template with this structure is available at github.com/HackYourFuture/data-mid-project.

Open the template repo.

The data-mid-project template repository on GitHub

Click Use this template → Create a new repository.

The "Use this template" button on GitHub

Give your repo a name, set it to Public, and click Create repository.

Filling in the new repository name and clicking Create

Clone your new repo and start replacing the stubs with your own logic.

The template gives you working boilerplate so you can focus on your own logic:

Ready to use: logging setup, env var validation at startup, Dockerfile, CI workflow, .env.example, test structure
Replace with your own: fetch_data() in pipeline.py (your API call), WeatherReading in models.py (your data shape), the transform() function in pipeline.py (your pandas logic), the table schema and INSERT in storage.py

The transform() function already contains commented-out examples: parsing timestamps, deriving columns, dropping nulls, renaming columns. Remove the examples that don't apply and add your own.

Start by calling your API locally and replacing the fetch_data() stub. Everything else can stay in place until the pipeline works end to end.

<aside> 💡 You do not have to use this exact layout, but your project should be organized enough that someone else can find and understand each component.

</aside>

What a complete run looks like

Here is the exact output of a successful pipeline run, from local to Azure:

Local run:

$ docker run --env-file .env my-pipeline
2026-03-30 10:00:01 INFO Pipeline starting
2026-03-30 10:00:02 INFO Fetched 24 records from Open-Meteo API
2026-03-30 10:00:02 INFO Validated 24 / 24 records
2026-03-30 10:00:02 INFO Transformed 24 rows
2026-03-30 10:00:03 INFO Inserted 24 rows into Postgres
2026-03-30 10:00:03 INFO Uploaded raw data to blob: pipeline/2026-03-30_100003.json
2026-03-30 10:00:03 INFO Pipeline finished: 24 records stored

Azure verification:

$ az containerapp job execution list --name weather-job --resource-group rg-hyf-data --output table
Name              Status     StartTime
----------------  ---------  -------------------
weather-job-abc1  Succeeded  2026-03-30T10:05:00

$ psql "$POSTGRES_URL" -c "SELECT COUNT(*) FROM your_table_name;"  # replace with your table name
 count
-------
    24

$ az storage blob list --account-name hyfstoragedev --container-name raw --prefix pipeline/ --output table
Name                                  Last Modified
------------------------------------  -------------------
pipeline/2026-03-30_100003.json       2026-03-30T10:00:03

Your output will have different numbers and names, but the pattern is the same: logs show counts, Postgres has rows, blob storage has files.

Assessment

The project is evaluated through a 15-20 minute technical interview on Day 5. It has four parts:

Part	Duration	What is evaluated
Technical questions	5-7 min	Can you discuss the concepts behind your project? (APIs, Pydantic, pandas, Docker, Azure)
Project demo	5-7 min	Does your deployment work? Can you show the evidence?
Code discussion	5-7 min	Can you explain why you made specific code decisions?
Code comprehension	5 min	Can you read unfamiliar pipeline code, find a bug, and suggest improvements?

Pass threshold: A grade of 6.0 or higher (out of 10), with no part scored 0.

The interview tests understanding, not memorization. Honest answers about trade-offs and limitations score higher than polished answers that avoid uncertainty. See Chapter 3: Gotchas & Pitfalls section 6 for specific preparation advice.

Submission

Ensure all your PRs are merged into main.
Your final main branch is the submission.
Include in the final PR description: a link to your Container App Job execution (screenshot or CLI output) and a short summary of what the pipeline does.

<aside> ⚠️ Delete your Container App Job after the teacher has evaluated your project. Do not leave jobs running. See Week 6 Chapter 6 for why this matters.

</aside>

Before submitting, review your work against the minimum requirements checklist above. A common mistake is submitting a working pipeline but forgetting CI, documentation, or cleanup.

<aside> 💡 Using AI to help: Ask an LLM to review your Dockerfile or README.md for common mistakes before submitting. Prompt: "Review this Dockerfile for a Python data pipeline and point out any issues." Always verify the suggestions yourself. (⚠️ Ensure no PII or sensitive company data is included!)

</aside>

Before you move on, here is some context on why checklists are a legitimate tool in professional engineering.

<aside> 🤓 Curious Geek: Why checklists work

</aside>

Extra reading

Pydantic documentation: data validation library
GitHub Actions quickstart: setting up CI workflows
Azure Container Apps Jobs: official deployment guide

Next up: Gotchas & Pitfalls, where you will find the most common mistakes students make during the Week 7 project and how to avoid them.