Introduction to Big Data and Streaming
Apache Spark Core Concepts
Databricks
Streaming Theory
Streaming Platforms: Kafka and Azure Event Hubs
Practice
Assignment
Gotchas & Pitfalls
Week 13 Lesson Plan (Teachers)
Gotchas & Pitfalls
Content coming soon...
Suggested Topics
- Using collect() on large Spark DataFrames: pulling all data to the driver crashes it
- Ignoring partitioning: too few or too many partitions causes performance problems
- Offset management in streaming: auto-commit can lose messages or process duplicates
- Serialization mismatches: producer and consumer must agree on the message format
- Cluster sizing: over-provisioning wastes money, under-provisioning causes timeouts
- Streaming without backpressure: a slow consumer falls further and further behind
- Azure Event Hubs pricing: throughput units cost money; always use the free/basic tier for learning
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

*https://hackyourfuture.net/*
Found a mistake or have a suggestion? Let us know in the feedback form.