Teachers

Week 4 - Data Processing with Pandas

Welcome to Week 4! You have learned how to structure code (Week 2) and ingest/validate data (Week 3). Now it's time to process it at scale. This week introduces Pandas, the industry-standard tool for high-performance data manipulation in Python. You will also learn about modern data architectures (ETL vs ELT) and efficient storage formats like Parquet.

By the end of this week, you will be able to load complex datasets, transform them efficiently using vectorized operations, and describe the architectural trade-offs between traditional ETL and modern ELT pipelines.

Learning goals

Master the Pandas library for tabular data manipulation (DataFrames and Series)
Select, filter, and sort data efficiently using loc, iloc, and boolean indexing
Perform grouping and aggregation operations to summarize data by categories
Join and merge multiple DataFrames using different join types (inner, outer, left, right)
Clean and transform text data using string operations and pattern matching
Work with datetime data: parsing, extracting components, and time-based calculations
Apply advanced transformations: pivoting, melting, window functions, and vectorized operations
Replace slow Python loops with high-performance vectorized operations
Handle data quality issues (missing values, duplicates) within DataFrames
Export processed data to CSV, Parquet, and SQLite databases

Back to Data Track

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.