Week 13 - Big Data & Streaming


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

OLAP vs OLTP and Modern Warehouses

Introduction to Big Data and Streaming

Apache Spark Core Concepts

Databricks

Streaming Theory

Streaming Platforms: Kafka and Azure Event Hubs

Gotchas & Pitfalls

Practice

Assignment

Week 13 Lesson Plan (Teachers)

Week 13 - Big Data & Streaming

<aside> 🚧 Planned restructure (April 2026). The five-chapter scaffold below (Spark → Databricks → streaming theory → Kafka) reads as a tool tour rather than a coherent skill set. Tracking issues reshape this into a single-substrate week teaching analytics engineering at platform scale on Databricks:

  1. Intro: the lakehouse idea
  2. PySpark notebooks in Databricks
  3. dbt on Databricks: incremental at 100M-row scale (callback to Week 10)
  4. Unity Catalog: governance, lineage, data classification
  5. Intro to Structured Streaming (scoped as a mental-model teaser, not a streaming role)

Kafka moves to week_13__going_further.md with pointers to Confluent's tutorials. Spark-standalone is absorbed into the PySpark chapter (Spark is what runs inside a Databricks notebook: no reason to teach them as separate topics).

See issue #112 (the dbt-on-Databricks chapter specifically) and issue #113 (the full-week restructure) for rationale, open decisions, and phasing. Do not re-scaffold the old tool-tour shape: any new work on Week 13 should land the chapter structure above.

</aside>


Welcome to Week 13! So far you have worked with datasets that fit comfortably on a single machine. This week introduces two areas where traditional approaches break down: big data processing with Apache Spark and streaming data with platforms like Kafka and Azure Event Hubs. These are advanced topics that expand your toolkit for handling scale and real-time data.

By the end of this week, you will understand how distributed computing works, run transformations in a Databricks notebook, and explore streaming concepts through Kafka theory and hands-on practice with Azure Event Hubs.

Learning goals


Prerequisites

Chapters

  1. Introduction to Big Data and Streaming
  2. Apache Spark Core Concepts
  3. Databricks
  4. Streaming Theory
  5. Streaming Platforms: Kafka and Azure Event Hubs
  6. Practice
  7. Assignment: Big Data and Streaming Lab
  8. Gotchas & Pitfalls

Lesson plan