Introduction to Big Data and Streaming
Apache Spark Core Concepts
Databricks
Streaming Theory
Streaming Platforms: Kafka and Azure Event Hubs
Practice
Assignment
Gotchas & Pitfalls
Week 13 Lesson Plan (Teachers)
Apache Spark Core Concepts
Content coming soon...
Suggested Topics
- What is Apache Spark: a distributed computing engine for large-scale data processing
- Spark architecture: driver, executors, cluster manager, and how work gets distributed
- RDDs (Resilient Distributed Datasets): the foundational abstraction
- Spark DataFrames: the pandas-like API for distributed data
- Transformations vs actions: lazy evaluation and when computation actually happens
- Partitioning and shuffling: how data moves across the cluster and why it matters for performance
- PySpark: writing Spark jobs in Python
- Spark SQL: running SQL queries on distributed data
- When Spark is overkill: understanding the overhead of distributed computing
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

*https://hackyourfuture.net/*
Found a mistake or have a suggestion? Let us know in the feedback form.