🎒 Assignment: MessyCorp Goes Pandas

Remember Week 2? You built a pure-Python pipeline to clean MessyCorp's messy sales data using the csv module, dataclasses, and hand-written loops. It worked, but MessyCorp has grown. The data volume is 10× bigger, management wants customer-enriched reports, and pure-Python loops become slow at this scale.

Time to rebuild the pipeline with Pandas.

Task 1: Get the Input Data

Use the provided CSV files from this repository:

Data Track/Week 4/data/messy_sales.csv (~120 rows, the same schema as Week 2, but bigger)
Data Track/Week 4/data/messy_customers.csv (~37 customers linked by email)

Copy them into your assignment folder under data/.

These files are intentionally messy. Expect the same problems from Week 2 (bad casing, missing values, invalid dates) plus new ones: duplicate rows, outlier prices, and join-key mismatches.

Task 2: Load and Explore

Load both files with pd.read_csv(). Then explore:

Use .info() to see column types and missing values.
Use .describe() to spot numeric outliers.
Use .head(20) to eyeball the mess.
Use .isna().sum() to count missing values per column.

<aside> 💡 Compare this to Week 2, where you had to loop through rows to discover problems. Pandas gives you a full picture in seconds.

</aside>

Task 3: Clean the Sales Data

Apply Pandas operations to fix the data. Think about how each step replaces a for loop from Week 2:

Normalize product names: .str.strip().str.title(), replacing your Week 2 string-cleaning function.
Normalize customer emails: .str.lower().str.strip() on both DataFrames.
Convert price to numeric: pd.to_numeric(errors='coerce'), which handles non-numeric values gracefully.
Parse dates: pd.to_datetime(errors='coerce'), replacing your manual date validation.
Drop bad rows:

Rows with missing product_name
Rows with negative price
Rows with zero quantity
Rows with invalid (NaT) dates

Remove duplicate transactions: Use .drop_duplicates(subset='transaction_id', keep='first'), a new concept not in Week 2.

<aside> 💡 Stuck on a cleaning step? Try describing the problem to an LLM: "I have a Pandas column with mixed-case strings and leading whitespace. How do I normalize it?" Compare its suggestion to the Pandas docs.

</aside>

<aside> ⌨️ Hands on: How should you handle outlier prices (e.g., a single item at 4999.99)? There's no single right answer: clip them, flag them, or leave them. Add a comment explaining your choice.

</aside>

Task 4: Clean and Join Customer Data

Normalize customer_email in both DataFrames using .str.lower().str.strip().
Merge sales with customers on customer_email using an inner join.
Add a is_high_value column: True where price * quantity >= 150.

<aside> ⌨️ Hands on: After your inner join, some orders disappeared, because they have emails that don't match any customer. Try a left join instead and inspect the rows where customer_name is NaN. What would you report to MessyCorp about these orphan orders?

</aside>

Task 5: Build Report Tables

Create these summary tables using groupby and named aggregations:

Weekly revenue by region

Add a week column using .dt.isocalendar().week
Group by week and region
Columns: week, region, total_revenue, order_count

Customer summary

Group by customer_email
Columns: customer_email, customer_name, region, loyalty_tier, total_spent, avg_order, order_count

Category performance

Group by category
Columns: category, total_revenue, order_count

Loyalty tier analysis

Group by loyalty_tier
Columns: loyalty_tier, avg_spent, customer_count

<aside> 💡 In Week 2, building just one report table took a lot of loop-based code. With Pandas groupby, each report is 2-3 lines.

</aside>

Task 6: Write Outputs

Write the report tables to an output/ folder:

weekly_revenue.csv
customer_summary.parquet
category_performance.csv

Task 7: Upload to Azure Blob Storage (Required)

You must upload at least one output file to Azure Blob Storage. Use weekly_revenue.csv.

Step 1: Create a Storage Account (Azure Portal)

Go to Azure Portal.
Search for Storage accounts and click Create.
Fill in:

Resource group: use an existing one or create a new one.
Storage account name: must be globally unique.
Region: choose the closest region.
Performance: Standard.
Redundancy: LRS.

Click Review + create, then Create.

Step 2: Create a Container

Open your storage account.
Go to Data storage -> Containers.
Create a container named week4-outputs.
Access level: Private.

<aside> 💡 Azure calls buckets containers.

</aside>

Step 3: Get a Connection String

In the storage account, go to Access keys.
Copy the Connection string for key1.
Store it in an environment variable called AZURE_STORAGE_CONNECTION_STRING.

<aside> ⚠️ Do not hardcode secrets or commit them to git.

</aside>

Step 4: Upload with Python

Install the SDK and upload the file:

pip install azure-storage-blob

from pathlib import Path
import os
from azure.storage.blob import BlobServiceClient

connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
container_name = "week4-outputs"

file_path = Path("output/weekly_revenue.csv")
blob_name = file_path.name

service = BlobServiceClient.from_connection_string(connection_string)
container = service.get_container_client(container_name)

if not container.exists():
    container.create_container()

with open(file_path, "rb") as file_handle:
    container.upload_blob(blob_name, file_handle, overwrite=True)

Task 8: AI Assist Report

Create a short file AI_ASSIST.md and describe:

The prompt you gave an LLM
The code it suggested
Whether it worked and what you changed

Technical Requirements

Use a clear project structure, for example:

week4-assignment/
├── data/
│   ├── messy_sales.csv
│   └── messy_customers.csv
├── output/
├── src/
│   ├── ingest.py
│   ├── clean.py
│   ├── transform.py
│   └── report.py
├── AI_ASSIST.md
└── README.md

Use pathlib.Path for file paths.
Use logging instead of print for pipeline status.
Use functions for each step: ingest, clean, transform, write.
No hardcoded absolute paths.
Store the Azure connection string in an environment variable, not in code.

🎒 Assignment: MessyCorp Goes Pandas

Task 1: Get the Input Data

Task 2: Load and Explore

Task 3: Clean the Sales Data

Task 4: Clean and Join Customer Data

Task 5: Build Report Tables

Task 6: Write Outputs

Task 7: Upload to Azure Blob Storage (Required)

Step 1: Create a Storage Account (Azure Portal)

Step 2: Create a Container

Step 3: Get a Connection String

Step 4: Upload with Python

Task 8: AI Assist Report

Technical Requirements

Extra reading