Week 13 - Big Data & Streaming

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

OLAP vs OLTP and Modern Warehouses

Introduction to Big Data and Streaming

Apache Spark Core Concepts

Databricks

Streaming Theory

Streaming Platforms: Kafka and Azure Event Hubs

Gotchas & Pitfalls

Practice

Assignment

Week 13 Lesson Plan (Teachers)

Week 13 - Big Data & Streaming

<aside> 🚧 Planned restructure (April 2026). The five-chapter scaffold below (Spark → Databricks → streaming theory → Kafka) reads as a tool tour rather than a coherent skill set. Tracking issues reshape this into a single-substrate week teaching analytics engineering at platform scale on Databricks:

Intro: the lakehouse idea
PySpark notebooks in Databricks
dbt on Databricks: incremental at 100M-row scale (callback to Week 10)
Unity Catalog: governance, lineage, data classification
Intro to Structured Streaming (scoped as a mental-model teaser, not a streaming role)

Kafka moves to week_13__going_further.md with pointers to Confluent's tutorials. Spark-standalone is absorbed into the PySpark chapter (Spark is what runs inside a Databricks notebook: no reason to teach them as separate topics).

See issue #112 (the dbt-on-Databricks chapter specifically) and issue #113 (the full-week restructure) for rationale, open decisions, and phasing. Do not re-scaffold the old tool-tour shape: any new work on Week 13 should land the chapter structure above.

</aside>

Welcome to Week 13! So far you have worked with datasets that fit comfortably on a single machine. This week introduces two areas where traditional approaches break down: big data processing with Apache Spark and streaming data with platforms like Kafka and Azure Event Hubs. These are advanced topics that expand your toolkit for handling scale and real-time data.

By the end of this week, you will understand how distributed computing works, run transformations in a Databricks notebook, and explore streaming concepts through Kafka theory and hands-on practice with Azure Event Hubs.

Learning goals

Understand the core concepts of distributed computing: partitioning, shuffling, and parallel execution
Explain how Apache Spark processes data across a cluster using RDDs and DataFrames
Run data transformations in a Databricks notebook connected to a Spark cluster
Distinguish between batch and streaming architectures and know when each is appropriate
Understand streaming theory: pub/sub patterns, message queues, brokers, and event-driven design
Understand the pub/sub messaging pattern and how streaming platforms like Kafka and Azure Event Hubs work
Explore Azure Event Hubs as an example of a managed streaming service
Recognize the trade-offs between batch and streaming in terms of complexity, latency, and cost

Prerequisites

Databricks Community Edition or HYF-provided workspace account
Azure for Students account (from prior weeks)

Chapters

Teachers

Lesson plan