When you send a prompt, the model doesn't write the full response in one go. It generates one token at a time, each time calculating the most probable next token based on everything before it - your prompt plus all the tokens it has already generated. This loop repeats until the response is complete.

Source: "Large Language Models explained briefly" by 3Blue1Brown
<aside> 💭
This has an important implication: the model has no ability to go back and revise. Each token is committed before the next one is chosen. What looks like a coherent, considered response is actually the result of thousands of small, sequential predictions.
</aside>
Looking at the animated image above, you may have noticed that the LLM doesn't provide just one option for the next token but it offers multiple options, each with a probability percentage.
Why doesn't the model always choose the most probable token?
If the model always picked the most probable token, every response would be identical and predictable.Instead, the model sometimes picks a less likely token which is what produces varied, creative, and natural-sounding responses. It’s a clever mathematical trick to introduce a bit of randomness to our world.
How much randomness is introduced is controlled by three parameters:
The context window is the total amount of text the model can "see" at any given moment. Your prompt, the conversation history, and the response being generated all share this space. Once the limit is reached, older content is dropped—not gradually forgotten, but removed entirely, as if it never existed. This is measured in tokens, not words or characters. Modern models have huge context windows - from hundreds of thousands to millions of tokens.
https://www.youtube.com/watch?v=-QVoIxEpFkM
The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0

Found a mistake or have a suggestion? Let us know in the feedback form.