Teachers

Tokens and Tokenization

Before an LLM can process your prompt, it needs to convert text into numbers. It does this by splitting text into tokens - chunks that can be words, parts of words, or punctuation. For example: Hello can be one token, unbelievable might be three: un, believe and able. Each LLM defines it’s own tokens using a token vocabulary.

<aside> 💡

LLMs work in tokens, not text. That's why you'll see the word "tokens" frequently when reading about or working with LLMs.

</aside>

Watch: Tokens high level overview

The following video provides an excellent introduction to tokens with clear examples. Watch it in full.

https://www.youtube.com/watch?v=nKSk_TiR8YA

Watch: Deep dive to tokenization

The following video explain really well the motivation of tokenization.

<aside> 💭

Watch the following video’s Tokenization chapter (from 7:47 until 14:30)

</aside>

https://youtu.be/7xTGNNLPyMI?si=yR4FoffvqRb-mJf9&t=467

<aside> 💡

A larger token vocabulary means more concepts map to a single token, resulting in shorter sequences. Shorter sequences are faster, cheaper, and fit more content into the context window.

</aside>

Tokens matter to you as a developer for three practical reasons:

API costs are priced per token, not per word or character
The context window is measured in tokens - it has a hard limit
LLMs struggle with character-level tasks like counting letters or reversing a string, because they never see individual characters, only chunks.

<aside> ⌨️

Hands on: Open https://tiktokenizer.vercel.app/ and play with the tool yourself

</aside>

Additional Resources

Video

https://www.youtube.com/watch?v=zduSFxRajkE (Advanced)

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

Week 11 - OOP concepts & LLMs

Tokens and Tokenization

Watch: Tokens high level overview

Watch: Deep dive to tokenization

Additional Resources

Video