Week 9 - SQL for Analytics

Introduction to SQL for Analytics

2. Joins, CTEs, and Aggregations

3. Data Validation Queries

4. OLAP vs OLTP and Modern Warehouses

4. Data Modeling Concepts

5. Building Views in Azure PostgreSQL

6. Gotchas & Pitfalls

7. Practice

8. Assignment

Lesson Plan

Introduction to SQL for Analytics

Welcome back 😃 After handing in your mid-track project, navigating the complexities of Docker and automating your first CI/CD pipelines, it’s now time to actually make sense of the data you’ve been moving!

In this chapter, we transition from building data pipelines to working directly with data in a structured and analytical way. While Python is commonly used to orchestrate and process data workflows, SQL is the primary language for querying and transforming data at scale within modern data systems.

This shift moves the focus from how data is moved and prepared, to how it is explored, shaped, and turned into insights, which is the ultimate goal!

SQL: The lingua franca for data analytics

In the tech world, things move fast, sometimes too fast. Tools like Airflow might get replaced in a decade. But SQL? SQL was invented in the early 1970s and it is still the undisputed champion of the data world.

💡

Fun fact: SQL was originally developed at IBM and was first called SEQUEL (Structured English Query Language) because it was designed to read almost like English. Due to trademark issues, the name was shortened to SQL, but many people still pronounce it “sequel” today.

In a world where everyone uses different programming languages, everyone still speaks SQL! It is really the universal language that connects Data Engineers, Analysts, and Business Stakeholders.

Operational vs. Analytical SQL

In this section, we explore the distinction between operational SQL, which focuses on day-to-day system behavior, and analytical SQL, which is designed for aggregating, transforming, and analyzing data to generate insights.

We can summarise the main differences between OLTP and OLAP as:


… But why not just do this in Python?

You’ve just spent weeks getting good at Python, so it’s natural you might think, "I'll just pd.read_sql() everything into a DataFrame and do my work there."

Altough tempting, this is a quick path to performance issues, increased costs and operational inefficiencies. Databases are highly optimised systems designed to do exactly what SQL is asking them to do: filter, join, aggregate, and transform data directly where it lives. When you pull large datasets into Python unnecessarily, you:

<aside> 💡 The Principle of Data Locality: Moving data is expensive. If your dataset is stored in an Azure PostgreSQL instance and you want to compute the average fare, SQL performs this computation inside the database and returns a single aggregated result. Python, on the other hand, can easily end up pulling large portions of the dataset into memory first.

</aside>

Choosing between SQL and Python depends on where and how the transformation is best executed besides which transformation you are aiming at.

Use SQL when:

Use Python when:

A useful rule of thumb:

If the data is already in a database and the operation is relational, use SQL. Otherwise, use Python.

How SQL analytical queries fit into the data pipeline: ingestion, transformation, reporting

Let’s now zoom out a bit to understand SQL’s role: in a modern data pipeline, SQL is most often used in the transformation and reporting layers, sitting between raw data ingestion and business-facing outputs.

As you’ve seen in previous chapters, a typical flow looks like this:

  1. Raw Data Ingestion (APIs, event logs, external files, etc)
  2. Transformation

This is where SQL plays a central role. Data is cleaned, joined, reshaped, and aggregated into meaningful structures. This may include:

  1. Reporting / Consumption

Finally, transformed datasets are used to power dashboards, reports, or downstream analytics tools. This layer focuses on readability, performance, and business relevance.

SQL is particularly powerful in the transformation stage because it allows these operations to happen directly inside the database, close to the data.

A common workflow is:

SQL prepares clean datasets → Python processes or models them → BI tools visualise the results.

When the modeling steps are not needed, BI tools like Tableau or PowerBI connect directly to the output of your SQL datasets. Most BI tools also allow to connect to your SQL database directly and write custom SQL queries within the BI Tool interface itself!

Common analytical question patterns

The goal of the analytical exploration can be often grouped into a small number of recurring patterns. Learning to recognise them helps you translate business questions into SQL more effectively:


In the next chapters, we’ll look into how to answer such questions using SQL.

Overview of SQL dialects

Although SQL is a standard language, different database systems implement their own dialects. These dialects share a common core but differ in functions, syntax details, and performance features.

  1. PostgreSQL
  1. T-SQL (Microsoft SQL Server, Azure SQL)
  1. BigQuery SQL (Google Cloud)
  1. Snowflake SQL

Today, most companies use cloud-based set-ups such as Azue, Google BigQuery or Snowflake. Despite the differences each SQL type has, most core concepts and functions remain transferable.

In this course we use PostgreSQL, which is considered one of the most standards-aligned SQL dialects.


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.