Week 4 - Data Processing with Pandas
Introduction to Pandas and DataFrames
Selecting, Filtering, and Sorting Data
Joining and Merging DataFrames
Working with Strings and Dates
Assignment: MessyCorp Goes Pandas
Remember Week 2? You built a pure-Python pipeline to clean MessyCorp's messy sales data using the csv module, dataclasses, and hand-written loops. It worked, but MessyCorp has grown. The data volume is 10× bigger, management wants customer-enriched reports, and pure-Python loops become slow at this scale.
Time to rebuild the pipeline with Pandas.
Use the provided CSV files from this repository:
Data Track/Week 4/data/messy_sales.csv (~120 rows, the same schema as Week 2, but bigger)Data Track/Week 4/data/messy_customers.csv (~37 customers linked by email)Copy them into your assignment folder under data/.
These files are intentionally messy. Expect the same problems from Week 2 (bad casing, missing values, invalid dates) plus new ones: duplicate rows, outlier prices, and join-key mismatches.
Load both files with pd.read_csv(). Then explore:
.info() to see column types and missing values..describe() to spot numeric outliers..head(20) to eyeball the mess..isna().sum() to count missing values per column.<aside> 💡 Compare this to Week 2, where you had to loop through rows to discover problems. Pandas gives you a full picture in seconds.
</aside>
Apply Pandas operations to fix the data. Think about how each step replaces a for loop from Week 2:
.str.strip().str.title(), replacing your Week 2 string-cleaning function..str.lower().str.strip() on both DataFrames.pd.to_numeric(errors='coerce'), which handles non-numeric values gracefully.pd.to_datetime(errors='coerce'), replacing your manual date validation.product_namepricequantity.drop_duplicates(subset='transaction_id', keep='first'), a new concept not in Week 2.<aside> 💡 Stuck on a cleaning step? Try describing the problem to an LLM: "I have a Pandas column with mixed-case strings and leading whitespace. How do I normalize it?" Compare its suggestion to the Pandas docs.
</aside>
<aside> ⌨️ Hands on: How should you handle outlier prices (e.g., a single item at 4999.99)? There's no single right answer: clip them, flag them, or leave them. Add a comment explaining your choice.
</aside>
customer_email in both DataFrames using .str.lower().str.strip().customer_email using an inner join.is_high_value column: True where price * quantity >= 150.<aside>
⌨️ Hands on: After your inner join, some orders disappeared, because they have emails that don't match any customer. Try a left join instead and inspect the rows where customer_name is NaN. What would you report to MessyCorp about these orphan orders?
</aside>
Create these summary tables using groupby and named aggregations:
week column using .dt.isocalendar().weekweek and regionweek, region, total_revenue, order_countcustomer_emailcustomer_email, customer_name, region, loyalty_tier, total_spent, avg_order, order_countcategorycategory, total_revenue, order_countloyalty_tierloyalty_tier, avg_spent, customer_count<aside>
💡 In Week 2, building just one report table took a lot of loop-based code. With Pandas groupby, each report is 2-3 lines.
</aside>
Write the report tables to an output/ folder:
weekly_revenue.csvcustomer_summary.parquetcategory_performance.csvYou must upload at least one output file to Azure Blob Storage. Use weekly_revenue.csv.
week4-outputs.<aside> 💡 Azure calls buckets containers.
</aside>
AZURE_STORAGE_CONNECTION_STRING.<aside> ⚠️ Do not hardcode secrets or commit them to git.
</aside>
Install the SDK and upload the file:
pip install azure-storage-blob
from pathlib import Path
import os
from azure.storage.blob import BlobServiceClient
connection_string = os.environ["AZURE_STORAGE_CONNECTION_STRING"]
container_name = "week4-outputs"
file_path = Path("output/weekly_revenue.csv")
blob_name = file_path.name
service = BlobServiceClient.from_connection_string(connection_string)
container = service.get_container_client(container_name)
if not container.exists():
container.create_container()
with open(file_path, "rb") as file_handle:
container.upload_blob(blob_name, file_handle, overwrite=True)
Create a short file AI_ASSIST.md and describe:
week4-assignment/
├── data/
│ ├── messy_sales.csv
│ └── messy_customers.csv
├── output/
├── src/
│ ├── ingest.py
│ ├── clean.py
│ ├── transform.py
│ └── report.py
├── AI_ASSIST.md
└── README.md
pathlib.Path for file paths.logging instead of print for pipeline status.