Teachers

Inference: How a Response is Generated

When you send a prompt, the model doesn't write the full response in one go. It generates one token at a time, each time calculating the most probable next token based on everything before it - your prompt plus all the tokens it has already generated. This loop repeats until the response is complete.

Source: "Large Language Models explained briefly" by 3Blue1Brown

<aside> 💭

This has an important implication: the model has no ability to go back and revise. Each token is committed before the next one is chosen. What looks like a coherent, considered response is actually the result of thousands of small, sequential predictions.

</aside>

Next token probability

Looking at the animated image above, you may have noticed that the LLM doesn't provide just one option for the next token but it offers multiple options, each with a probability percentage.

Why doesn't the model always choose the most probable token?

If the model always picked the most probable token, every response to a given prompt would be identical and predictable. Instead, the model sometimes picks a less likely token which is what produces varied, creative, and natural-sounding responses. It’s a clever mathematical trick to introduce a bit of randomness to our world.

How much randomness is introduced is controlled by three parameters:

Temperature: the main dial. Low values make the model stick to high-probability tokens (predictable, focused). High values give lower-probability tokens more of a chance, producing creative and varied responses, but also increasing the risk of hallucinations.
Top-k: limits the selection to the k most likely tokens at each step, ignoring everything else.
Top-p: instead of a fixed number, takes the smallest group of tokens whose combined probability reaches a threshold (e.g. 90%), then samples from that group.

The context window

The context window is the total amount of text the model can "see" at any given moment. Your prompt, the conversation history, and the response being generated all share this space. Once the limit is reached, older content is dropped—not gradually forgotten, but removed entirely, as if it never existed. This is measured in tokens, not words or characters. Modern models have huge context windows - from hundreds of thousands to millions of tokens.

Watch: What is a context window

https://www.youtube.com/watch?v=-QVoIxEpFkM

<aside> 💡

The context window is the short-term memory of an LLM. Manually adding information to the context window gives the model more knowledge during the conversation. This is used in a web search tool

</aside>

Additional Resources

Video

https://www.youtube.com/watch?v=rURRYI66E54

The HackYourFuture curriculum is licensed under CC BY-NC-SA 4.0 *https://hackyourfuture.net/*

CC BY-NC-SA 4.0 Icons

Built with ❤️ by the HackYourFuture community · Thank you, contributors

Found a mistake or have a suggestion? Let us know in the feedback form.

Week 11 - OOP concepts & LLMs