Week 4 - Data Processing with Pandas

Introduction to Pandas and DataFrames

Selecting, Filtering, and Sorting Data

Grouping and Aggregation

Joining and Merging DataFrames

Working with Strings and Dates

Advanced Transformations

Writing Data

Visualizing Data with Pandas

Alternatives to Pandas

Practice

Assignment: MessyCorp Goes Pandas

Gotchas & Pitfalls

Lesson Plan

🎒 Assignment: MessyCorp Goes Pandas

Remember Week 2? You built a pure-Python pipeline to clean MessyCorp's messy sales data using the csv module, dataclasses, and hand-written loops. It worked, but MessyCorp has grown. The data volume is 10× bigger, management wants customer-enriched reports, and pure-Python loops become slow at this scale.

Time to rebuild the pipeline with Pandas.

Task 1: Get the Input Data

Use the provided CSV files from this repository:

Copy them into your assignment folder under data/.

These files are intentionally messy. Expect the same problems from Week 2 (bad casing, missing values, invalid dates) plus new ones: duplicate rows, outlier prices, and join-key mismatches.

Task 2: Load and Explore

Load both files with pd.read_csv(). Then explore:

  1. Use .info() to see column types and missing values.
  2. Use .describe() to spot numeric outliers.
  3. Use .head(20) to eyeball the mess.
  4. Use .isna().sum() to count missing values per column.

<aside> 💡 Compare this to Week 2, where you had to loop through rows to discover problems. Pandas gives you a full picture in seconds.

</aside>

Task 3: Clean the Sales Data

Apply Pandas operations to fix the data. Think about how each step replaces a for loop from Week 2:

  1. Normalize product names: .str.strip().str.title(), replacing your Week 2 string-cleaning function.
  2. Normalize customer emails: .str.lower().str.strip() on both DataFrames.
  3. Convert price to numeric: pd.to_numeric(errors='coerce'), which handles non-numeric values gracefully.
  4. Parse dates: pd.to_datetime(errors='coerce'), replacing your manual date validation.
  5. Drop bad rows:
  1. Remove duplicate transactions: Use .drop_duplicates(subset='transaction_id', keep='first'), a new concept not in Week 2.

<aside> 💡 Stuck on a cleaning step? Try describing the problem to an LLM: "I have a Pandas column with mixed-case strings and leading whitespace. How do I normalize it?" Compare its suggestion to the Pandas docs.

</aside>

<aside> ⌨️ Hands on: How should you handle outlier prices (e.g., a single item at 4999.99)? There's no single right answer: clip them, flag them, or leave them. Add a comment explaining your choice.

</aside>

Task 4: Clean and Join Customer Data

  1. Normalize customer_email in both DataFrames using .str.lower().str.strip().
  2. Merge sales with customers on customer_email using an inner join.
  3. Add a is_high_value column: True where price * quantity >= 150.

<aside> ⌨️ Hands on: After your inner join, some orders disappeared, because they have emails that don't match any customer. Try a left join instead and inspect the rows where customer_name is NaN. What would you report to MessyCorp about these orphan orders?

</aside>

Task 5: Build Report Tables

Create these summary tables using groupby and named aggregations:

  1. Weekly revenue by region
  1. Customer summary
  1. Category performance
  1. Loyalty tier analysis

<aside> 💡 In Week 2, building just one report table took a lot of loop-based code. With Pandas groupby, each report is 2-3 lines.

</aside>

Task 6: Write Outputs

Write the report tables to an output/ folder:

Task 7: Upload to Azure Blob Storage (Required)

You must upload at least one output file to Azure Blob Storage. Use weekly_revenue.csv.

Step 1: Create a Storage Account (Azure Portal)

  1. Go to Azure Portal.
  2. Search for Storage accounts and click Create.
  3. Fill in:
  1. Click Review + create, then Create.

Step 2: Create a Container

  1. Open your storage account.
  2. Go to Data storage -> Containers.
  3. Create a container named week4-outputs.
  4. Access level: Private.

<aside> 💡 Azure calls buckets containers.

</aside>

Step 3: Get a Connection String

  1. In the storage account, go to Access keys.
  2. Copy the Connection string for key1.
  3. Store it in an environment variable called AZURE_STORAGE_CONNECTION_STRING.

<aside> ⚠️ Do not hardcode secrets or commit them to git.

</aside>

Step 4: Upload with Python

Install the SDK and upload the file:

pip install azure-storage-blob
from pathlib import Path
import os
from azure.storage.blob import BlobServiceClient

connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
container_name = "week4-outputs"

file_path = Path("output/weekly_revenue.csv")
blob_name = file_path.name

service = BlobServiceClient.from_connection_string(connection_string)
container = service.get_container_client(container_name)

if not container.exists():
    container.create_container()

with open(file_path, "rb") as file_handle:
    container.upload_blob(blob_name, file_handle, overwrite=True)

Task 8: AI Assist Report

Create a short file AI_ASSIST.md and describe:

Technical Requirements

week4-assignment/
├── data/
│   ├── messy_sales.csv
│   └── messy_customers.csv
├── output/
├── src/
│   ├── ingest.py
│   ├── clean.py
│   ├── transform.py
│   └── report.py
├── AI_ASSIST.md
└── README.md

Extra reading