Week 13 - Big Data & Streaming

Introduction to Big Data and Streaming

Apache Spark Core Concepts

Streaming Theory

Streaming Platforms: Kafka and Azure Event Hubs

Gotchas & Pitfalls

Week 13 Lesson Plan (Teachers)

Apache Spark Core Concepts

Content coming soon...

Suggested Topics

What is Apache Spark: a distributed computing engine for large-scale data processing
Spark architecture: driver, executors, cluster manager, and how work gets distributed
RDDs (Resilient Distributed Datasets): the foundational abstraction
Spark DataFrames: the pandas-like API for distributed data
Transformations vs actions: lazy evaluation and when computation actually happens
Partitioning and shuffling: how data moves across the cluster and why it matters for performance
PySpark: writing Spark jobs in Python
Spark SQL: running SQL queries on distributed data
When Spark is overkill: understanding the overhead of distributed computing

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.