Week 13 - Big Data & Streaming

Introduction to Big Data and Streaming

Apache Spark Core Concepts

Streaming Theory

Streaming Platforms: Kafka and Azure Event Hubs

Gotchas & Pitfalls

Week 13 Lesson Plan (Teachers)

Gotchas & Pitfalls

Content coming soon...

Suggested Topics

Using collect() on large Spark DataFrames: pulling all data to the driver crashes it
Ignoring partitioning: too few or too many partitions causes performance problems
Offset management in streaming: auto-commit can lose messages or process duplicates
Serialization mismatches: producer and consumer must agree on the message format
Cluster sizing: over-provisioning wastes money, under-provisioning causes timeouts
Streaming without backpressure: a slow consumer falls further and further behind
Azure Event Hubs pricing: throughput units cost money; always use the free/basic tier for learning

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.