Before an LLM can process your prompt, it needs to convert text into numbers. It does this by splitting text into tokens - chunks that can be words, parts of words, or punctuation. For example: Hello can be one token, unbelievable might be three: un, believe and able. Each LLM defines it’s own tokens using a token vocabulary.
<aside> 💡
LLMs work in tokens, not text. That's why you'll see the word "tokens" frequently when reading about or working with LLMs.
</aside>
The following video provides an excellent introduction to tokens with clear examples. Watch it in full.
https://www.youtube.com/watch?v=nKSk_TiR8YA
The following video explain really well the motivation of tokenization.
<aside> 💭
Watch the following video’s Tokenization chapter (from 7:47 until 14:30)
</aside>
https://youtu.be/7xTGNNLPyMI?si=yR4FoffvqRb-mJf9&t=467
<aside> ⚠️
A larger token vocabulary means more concepts map to a single token, resulting in shorter sequences. Shorter sequences are faster, cheaper, and fit more content into the context window.
</aside>
Tokens matter to you as a developer for three practical reasons:
<aside> ⌨️
Hands on: Open https://tiktokenizer.vercel.app/ and play with the tool yourself
</aside>
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.