Week 3 - Ingesting and Validating Data
Introduction to Data Ingestion
Assignment: Build a Validated Ingestion Pipeline
Going Further: Optional Deep Dives
History: APIs and Data Transfer
This page is optional. Nothing here is required for Week 3's learning goals or the assignment. It exists for students who want to understand how the data formats and protocols you used this week came to be. Read it in one sitting, or skip it and come back when you are curious.
Week 5's History of Containers and CI/CD traces how applications got packaged and shipped. This page covers a different arc: how machines learned to talk to each other and exchange structured data, from mainframe-era batch files to today's REST APIs and binary protocols.
Background for Reading Multiple File Formats
Before the internet connected companies, businesses still needed to exchange orders, invoices, and shipment notices between their mainframe computers. The answer was Electronic Data Interchange (EDI), a set of fixed-format message standards that emerged in the 1960s. Companies sent EDI files over private leased lines and, later, over value-added networks (VANs) run by intermediary companies.
EDI messages were structured but not human-readable. An order might look like ISA*00* *00* *ZZ*SENDER*ZZ*RECEIVER*200101*.... Every field had a fixed position, a fixed length, and a numeric type code from a standards body. The dominant standards were ANSI X12 (used in North America) and EDIFACT (Europe, sponsored by the UN).
<aside> 🤓 Curious Geek: EDIFACT is still alive
UN/EDIFACT is not a museum artifact. The automotive, retail, and healthcare industries still exchange billions of EDIFACT messages per year. When you ingest CSV exports from an old ERP system, you are often looking at data that originated as an EDI batch, converted one step at a time over decades.
</aside>
As a data engineer, you still encounter the legacy of this era: flat files, fixed-width formats, batch-window scheduling ("send us the previous day's orders by 6 AM"), and the expectation that data arrives as files rather than API calls. Chapter 4's CSV and Parquet work sits in this lineage.
Background for Ingesting from APIs
In the late 1990s, the internet gave companies a cheaper alternative to private leased lines: they could exchange data over HTTP. The question was what format to use. The answer the industry standardized on was XML, a hierarchical text format ratified by the W3C in 1998.
XML looked like HTML but was designed for data, not presentation. A purchase order that took a 200-byte EDI string now became a 2,000-byte XML document with named tags, nested elements, and a schema (XSD) describing the allowed structure. Verbose, but readable by humans and machines.
On top of XML, the industry built SOAP (Simple Object Access Protocol, 1998). SOAP defined how to wrap a request and a response in XML envelopes and send them over HTTP. It came with a stack of companion standards: WSDL (describing what operations a service offered), WS-Security (encryption), WS-ReliableMessaging (delivery guarantees), and more. Together these were called "WS-*" (WS-star), and implementing them all correctly required specialized middleware and teams of enterprise architects.
SOAP was powerful, but the learning curve was steep. Generating and parsing XML was slow. A service that returned a list of ten orders sent 10 KB of envelope overhead around 2 KB of data.
Background for Ingesting from APIs
The next catalyst was not a new protocol. It was a new use of an existing one. In 2005, Jesse James Garrett published an essay coining the term AJAX (Asynchronous JavaScript and XML): the technique of using the browser's built-in XMLHttpRequest object to fetch data from a server after the page had already loaded, and update just part of the page rather than reloading it.
Google Maps launched the same year. Before AJAX, a map required a page reload to pan. With AJAX, dragging the map fetched new tiles silently. Gmail showed new emails without refreshing the inbox. Suddenly the browser was not a document viewer: it was a client making API calls.
Two things followed. First, developers realised that XML was painful to parse in JavaScript. JSON (JavaScript Object Notation) was a simpler format: native to JavaScript, human-readable, and half the size of equivalent XML. By 2006 it had its own RFC (4627). Second, the volume of API calls exploded. SOAP services built for batch overnight runs were not designed to handle thousands of browser requests per minute.
Background for Ingesting from APIs
REST (Representational State Transfer) is not a protocol. It is an architectural style described by Roy Fielding in his 2000 doctoral dissertation. Fielding argued that the web's own design (URLs identify resources, HTTP verbs express operations, responses are cacheable) was already a good API model. You did not need new standards: just use HTTP correctly.
A REST API represents resources as URLs (/orders/42) and uses standard HTTP methods (GET to read, POST to create, PUT/PATCH to update, DELETE to remove). Responses are usually JSON. There is no envelope, no schema file to download, no separate toolchain.
The contrast with SOAP was stark. Where a SOAP service required a WSDL file, generated client stubs, and a WS-* middleware stack, a REST endpoint could be called with curl in one line. This is exactly what you did in Ingesting from APIs with requests.get().
<aside> 🤓 Curious Geek: "Simple" is relative
REST is simple compared to SOAP. But "REST" has become a loose term: most APIs called "REST APIs" violate one or more of Fielding's constraints. Fielding himself wrote a frustrated blog post in 2008 noting that APIs advertising REST often ignored statelessness, caching, or hypermedia. What the industry calls REST is really "JSON over HTTP". Fielding's original vision has a more precise implementation: HATEOAS (Hypermedia as the Engine of Application State), where responses include links to next actions. Few public APIs implement it.
</aside>
For data engineers, REST APIs are the dominant source of external data today. Pagination patterns (?page=1, ?cursor=abc, Link: headers), rate-limit headers (X-RateLimit-Remaining), and authentication schemes (API keys, OAuth tokens) are all REST conventions you work with daily.
Background for the Going Further section on binary protocols
JSON is human-readable but slow to parse and verbose over the wire. As data volumes grew, companies began using binary serialization formats that skip the string-parsing step entirely.
Protocol Buffers (Protobuf), released by Google in 2008, describe a schema in a .proto file. A code generator produces typed read/write classes in your language of choice. A Protobuf message is 3-10x smaller than equivalent JSON and parses faster. Apache Avro (2009, from the Hadoop project) took a similar approach but embedded the schema in the data file, which makes it easier to read files written by older versions of a schema.
gRPC (2016, Google) built an RPC framework on top of Protobuf and HTTP/2. It gives you typed service definitions, bidirectional streaming, and automatic code generation in 11 languages. It is common in microservice communication where teams control both endpoints, and increasingly common as the transport for real-time data pipelines (Kafka-to-service calls, Pub/Sub delivery).
As a data engineer you will encounter Protobuf most often when consuming Kafka or Pub/Sub topics (event streams), calling internal microservice APIs, and reading exports from Google products. Avro appears in Hadoop-lineage systems: HDFS, Hive, and data lake formats built on those.
Each generation solved the previous generation's biggest pain, and left a new one behind:
| Era | Technology | What it solved | What it left behind |
|---|---|---|---|
| 1960s-1990s | EDI / X12 / EDIFACT | Machine-readable business documents | Fixed formats, batch-only, private networks |
| Late 1990s | SOAP / WSDL / XML | Internet-accessible services with schemas | Enormous verbosity and toolchain complexity |
| Mid 2000s | AJAX + JSON | Browser clients, lighter payloads | Informal standards, explosion of incompatible APIs |
| 2000s-present | REST over HTTP | Simplicity: one URL, one verb, one JSON body | No formal schema, no guaranteed behavior |
| 2008-present | Protobuf / Avro / gRPC | Fast binary transport, typed schemas, streaming | Requires schema management, not human-readable |
As a data engineer you do not pick one: you inherit all of them. A single pipeline may ingest a nightly EDI flat file, poll a REST API every hour, and consume a Protobuf Kafka topic in real time. Understanding the history explains why each format exists, what it is optimized for, and where it breaks down.
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

Built with ❤️ by the HackYourFuture community · Thank you, contributors
Found a mistake or have a suggestion? Let us know in the feedback form.