Introduction to SQL for Analytics

Welcome back 😃 After handing in your mid-track project, navigating the complexities of Docker and automating your first CI/CD pipelines, it’s now time to actually make sense of the data you’ve been moving!

In this chapter, we transition from building data pipelines to working directly with data in a structured and analytical way. While Python is commonly used to orchestrate and process data workflows, SQL is the primary language for querying and transforming data at scale within modern data systems.

This shift moves the focus from how data is moved and prepared, to how it is explored, shaped, and turned into insights, which is the ultimate goal!

SQL: The lingua franca for data analytics

In the tech world, things move fast, sometimes too fast. Tools like Airflow might get replaced in a decade. But SQL? SQL was invented in the early 1970s and it is still the undisputed champion of the data world.

💡

Fun fact: SQL was originally developed at IBM and was first called SEQUEL (Structured English Query Language) because it was designed to read almost like English. Due to trademark issues, the name was shortened to SQL, but many people still pronounce it “sequel” today.

In a world where everyone uses different programming languages, everyone still speaks SQL! It is really the universal language that connects Data Engineers, Analysts, and Business Stakeholders.

Operational vs. Analytical SQL

In this section, we explore the distinction between operational SQL, which focuses on day-to-day system behavior, and analytical SQL, which is designed for aggregating, transforming, and analyzing data to generate insights.

Operational SQL (OLTP): Think of a waiter taking an order. They need to add one burger, change a drink to a soda, or delete a dessert. It’s fast, one-row-at-a-time, and focuses on the now.
Analytical SQL (OLAP): Think of the restaurant owner at the end of the month. They don't care about a single burger order. They want to know: "Which menu items were the most profitable on rainy Tuesdays?" To answer that, the database has to scan millions of rows and squash them down into a single meaningful number. That "squashing" is where you get the most out of SQL capabilities.

We can summarise the main differences between OLTP and OLAP as:

… But why not just do this in Python?

You’ve just spent weeks getting good at Python, so it’s natural you might think, "I'll just pd.read_sql() everything into a DataFrame and do my work there."

Altough tempting, this is a quick path to performance issues, increased costs and operational inefficiencies. Databases are highly optimised systems designed to do exactly what SQL is asking them to do: filter, join, aggregate, and transform data directly where it lives. When you pull large datasets into Python unnecessarily, you:

increase memory usage on your local machine or compute instance
slow down processing due to data transfer over the network
lose the performance benefits of database query optimisers

<aside> 💡 The Principle of Data Locality: Moving data is expensive. If your dataset is stored in an Azure PostgreSQL instance and you want to compute the average fare, SQL performs this computation inside the database and returns a single aggregated result. Python, on the other hand, can easily end up pulling large portions of the dataset into memory first.

</aside>

Choosing between SQL and Python depends on where and how the transformation is best executed besides which transformation you are aiming at.

Use SQL when:

Working directly with structured data in a database
Performing joins, aggregations, or filtering
Handling large datasets
Building reusable analytical tables or views

Use Python when:

You need complex logic that is difficult in SQL
Performing statistical analysis or machine learning
Working with APIs or unstructured data
Orchestrating multi-step workflows

A useful rule of thumb:

If the data is already in a database and the operation is relational, use SQL. Otherwise, use Python.

How SQL analytical queries fit into the data pipeline: ingestion, transformation, reporting

Let’s now zoom out a bit to understand SQL’s role: in a modern data pipeline, SQL is most often used in the transformation and reporting layers, sitting between raw data ingestion and business-facing outputs.

As you’ve seen in previous chapters, a typical flow looks like this:

Raw Data Ingestion (APIs, event logs, external files, etc)
Transformation

This is where SQL plays a central role. Data is cleaned, joined, reshaped, and aggregated into meaningful structures. This may include:

removing duplicates
standardising formats
joining fact and dimension tables
creating derived metrics

Reporting / Consumption

Finally, transformed datasets are used to power dashboards, reports, or downstream analytics tools. This layer focuses on readability, performance, and business relevance.

SQL is particularly powerful in the transformation stage because it allows these operations to happen directly inside the database, close to the data.

A common workflow is:

SQL prepares clean datasets → Python processes or models them → BI tools visualise the results.

When the modeling steps are not needed, BI tools like Tableau or PowerBI connect directly to the output of your SQL datasets. Most BI tools also allow to connect to your SQL database directly and write custom SQL queries within the BI Tool interface itself!

Common analytical question patterns

The goal of the analytical exploration can be often grouped into a small number of recurring patterns. Learning to recognise them helps you translate business questions into SQL more effectively:

In the next chapters, we’ll look into how to answer such questions using SQL.

Overview of SQL dialects

Although SQL is a standard language, different database systems implement their own dialects. These dialects share a common core but differ in functions, syntax details, and performance features.

PostgreSQL

Open-source and widely used in data engineering
Strong support for analytical queries, JSON, and extensions
The primary database used in this course

T-SQL (Microsoft SQL Server, Azure SQL)

Microsoft’s SQL dialect
Common in enterprise environments using Azure SQL or SQL Server
Adds procedural programming features and system functions

BigQuery SQL (Google Cloud)

Used in Google BigQuery
Optimised for large-scale, serverless analytics
Supports standard SQL with some extensions

Snowflake SQL

Used in Snowflake data warehouse
Designed for cloud-scale analytics and performance
Highly compatible with standard SQL but includes platform-specific functions

Today, most companies use cloud-based set-ups such as Azue, Google BigQuery or Snowflake. Despite the differences each SQL type has, most core concepts and functions remain transferable.

In this course we use PostgreSQL, which is considered one of the most standards-aligned SQL dialects.

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.