Throughout the program, you have been using AI tools like ChatGPT and Claude through their websites. These tools run on powerful servers in the cloud. When you send a message, it travels over the internet to a data center, gets processed, and the response is sent back to you.
But what if you could run an AI model on your own machine? No internet needed, no account required, and your conversations stay completely private. This is what local LLMs allow you to do.
In this section, you will use Docker to run Ollama - a local framework that makes it easy to download and run open-source language models locally. With one command, you will have a working AI chatbot running inside a container on your laptop. Even better, you will have a free, unlimited API to interact with the local AI.
We will use the official Ollama image on DockerHub. Run the following command to pull and run a new Ollama container on your machine.
docker run -d -p 11434:11434 --name ollama ollama/ollama
<aside> ⚠️
Big download ahead: Over 3GB. It may take a while.
</aside>
<aside> 💡
The -d flag stands for “Detached”. This means your container runs in the background. You can still start and stop it using the Docker UI or the Docker CLI.
</aside>
Confirm that the container is running with the command
docker ps
Let’s get inside the container itself by running the following command:
docker exec -ti ollama bash
You should be able to get a new prompt such as:
root@be038027677d:/#
You are now inside the container, logged in as the root user in a bash shell. You can use common commands such as cd, ls, and cat to explore the container’s file system.
From now on, we will use the ollama command that comes with the container.
ollama --help
Ollama supports a wide variety of open-source models. Check out the full library.
What do parameters mean?
When you see a model described as "1B" or "7B," the number refers to billions of parameters. More parameters generally means the model can understand more complex questions and give better answers, but it also means the model is larger and needs more memory to run.
A rough rule of thumb: every 1B parameters needs about 1 GB of RAM. So a 7B model needs roughly 7 GB of available memory just for the model, on top of what your operating system and other apps are using. Here is a recommendation table for your reference:
| Your RAM | Recommended size | Example models |
|---|---|---|
| 8 GB | Up to 3B | tinyllama, llama3.2:1b, qwen3:0.6b |
| 16 GB | Up to 7B | mistral, llama3.2:3b, gemma3:4b |
| 32 GB | Up to 14B | phi3:14b, qwen2.5:14b |
<aside> ⚠️
Huge models with 70B+ parameters are not meant to run on personal computers. They require a very powerful machine with very expensive hardware.
</aside>
In our demo, we will use the llama3.2:1b model. It’s great for learning purposes.
Inside your Ollama container, pull the llama3.2:1b model:
ollama pull llama3.2:1b
It may take a couple of minutes to download the model.
While still in the container shell, run the model via the CLI by typing:
ollama run llama3.2:1b
<aside> ⚠️
This command will load the model into memory. You will see an increase of about 1 GB in RAM usage.
</aside>
You can now interact with the model directly in the terminal and the model will reply:

“What is the difference between JavaScript and Java”. Pretty good answer for a small model.
To exit the chat, type /bye
<aside> ⚠️
The model will still be loaded in memory after leaving the chat. Ollama will remove it from memory after a couple of minutes of inactivity.
</aside>
<aside> 💡
Use ollama ps to see which models are loaded and ollama stop to unload models from memory.
</aside>
The Ollama container comes with a built-in HTTP API to be used on http://localhost:11434.
Generating a response
Send HTTP Post request to: http://localhost:11434/api/generatewith the JSON Body:
{
"model": "llama3.2:1b",
"prompt": "What is the capital of the Netherlands?",
"stream": false
}
<aside> ⌨️
Hands on: Send the request above using Postman or curl.
</aside>
In the example above, we turned off streaming to keep the response simple. It may take up to a minute to return. The time depends on the length of the LLM’s reply.
Learn more about the HTTP API in the Ollama API documentation
Good news! Interacting with Ollama using JavaScript can be done with the same openai package from Week 11. The only changes you need to make:
baseURL to your Ollama serverThe code should look already familiar:
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "<http://localhost:11434/v1/>",
apiKey: "not-important", // Can be any non-empty string
timeout: 60000
});
const response = await openai.chat.completions.create({
model: "llama3.2:1b",
messages: [{ role: "user", content: "Why is the sky blue?. Give a quick explanation." }],
});
const reply = response.choices[0].message.content;
console.log(reply);
Waiting 30 seconds or even a minute for a response can be annoying. Luckily, the API has an option to stream data token by token, so you can start showing some response immediately.
The following code shows how to do it.
import OpenAI from "openai";
const openai = new OpenAI({
baseURL: "<http://localhost:11434/v1/>",
apiKey: "not-important",
timeout: 60000,
});
const stream = await openai.chat.completions.create({
model: "llama3.2:1b",
messages: [{ role: "user", content: "Why is the sky blue?" }],
stream: true
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
process.stdout.write(content || "");
}
Notes from the code:
stream: true to enable streaming.for await loop.process.stdout.write() is similar to console.log(), but it doesn’t add a new line at the end.<aside> ❗
</aside>
In this demo, we ran the model on the CPU. This works well for small models, but for larger models, the CPU cannot keep up. If you have a powerful graphics card (GPU) on your machine, you can have Ollama use it to speed up response generation significantly.
Learn more: