Week 11 - OOP concepts & LLMs

Object oriented programming

Classes and objects

Encapsulation

Code style: clean code

LLMs

Tokenization

Inference

Tools

RAGs

Using LLMs in Code

Practice

Assignment

Back to core program

Tokens and Tokenization

Before an LLM can process your prompt, it needs to convert text into numbers. It does this by splitting text into tokens - chunks that can be words, parts of words, or punctuation. For example: Hello can be one token, unbelievable might be three: un, believe and able. Each LLM defines it’s own tokens using a token vocabulary.

<aside> 💡

LLMs work in tokens, not text. That's why you'll see the word "tokens" frequently when reading about or working with LLMs.

</aside>

Watch: Tokens high level overview

The following video provides an excellent introduction to tokens with clear examples. Watch it in full.

https://www.youtube.com/watch?v=nKSk_TiR8YA

Watch: Deep dive to tokenization

The following video explain really well the motivation of tokenization.

<aside> 💭

Watch the following video’s Tokenization chapter (from 7:47 until 14:30)

</aside>

https://youtu.be/7xTGNNLPyMI?si=yR4FoffvqRb-mJf9&t=467

<aside> ⚠️

A larger token vocabulary means more concepts map to a single token, resulting in shorter sequences. Shorter sequences are faster, cheaper, and fit more content into the context window.

</aside>

Tokens matter to you as a developer for three practical reasons:

  1. API costs are priced per token, not per word or character
  2. The context window is measured in tokens - it has a hard limit
  3. LLMs struggle with character-level tasks like counting letters or reversing a string, because they never see individual characters, only chunks.

<aside> ⌨️

Hands on: Open https://tiktokenizer.vercel.app/ and play with the tool yourself

</aside>

Additional Resources

Video


The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

CC BY-NC-SA 4.0 Icons

*https://hackyourfuture.net/*

Found a mistake or have a suggestion? Let us know in the feedback form.