Week 13 - Systems

App Development lifecycle

Continuous integration (CI)

Continuous Delivery (CD)

Packaging and Docker

Docker setup

Building an API server

Deployments

Intro to Cloud

Using local LLMs

Appendix 1 yaml syntax

Appendix 2 Docker commands

Practice

Assignment

Core program

Running LLM on your own machine

Throughout the program, you have been using AI tools like ChatGPT and Claude through their websites. These tools run on powerful servers in the cloud. When you send a message, it travels over the internet to a data center, gets processed, and the response is sent back to you.

But what if you could run an AI model on your own machine? No internet needed, no account required, and your conversations stay completely private. This is what local LLMs allow you to do.

In this section, you will use Docker to run Ollama - a local framework that makes it easy to download and run open-source language models locally. With one command, you will have a working AI chatbot running inside a container on your laptop. Even better, you will have a free, unlimited API to interact with the local AI.

Step 1: Pull and run Ollama container

We will use the official Ollama image on DockerHub. Run the following command to pull and run a new Ollama container on your machine.

docker run -d  -p 11434:11434 --name ollama ollama/ollama

<aside> ⚠️

Big download ahead: Over 3GB. It may take a while.

</aside>

<aside> 💡

The -d flag stands for “Detached”. This means your container runs in the background. You can still start and stop it using the Docker UI or the Docker CLI.

</aside>

Confirm that the container is running with the command

docker ps

Step 2: Get into the container

Let’s get inside the container itself by running the following command:

docker exec -ti ollama bash

You should be able to get a new prompt such as:

root@be038027677d:/# 

You are now inside the container, logged in as the root user in a bash shell. You can use common commands such as cd, ls, and cat to explore the container’s file system.

From now on, we will use the ollama command that comes with the container.

ollama --help

Step 3: Choosing a model

Ollama supports a wide variety of open-source models. Check out the full library.

What do parameters mean?

When you see a model described as "1B" or "7B," the number refers to billions of parameters. More parameters generally means the model can understand more complex questions and give better answers, but it also means the model is larger and needs more memory to run.

A rough rule of thumb: every 1B parameters needs about 1 GB of RAM. So a 7B model needs roughly 7 GB of available memory just for the model, on top of what your operating system and other apps are using. Here is a recommendation table for your reference:

Your RAM Recommended size Example models
8 GB Up to 3B tinyllama, llama3.2:1b, qwen3:0.6b
16 GB Up to 7B mistral, llama3.2:3b, gemma3:4b
32 GB Up to 14B phi3:14b, qwen2.5:14b

<aside> ⚠️

Huge models with 70B+ parameters are not meant to run on personal computers. They require a very powerful machine with very expensive hardware.

</aside>

In our demo, we will use the llama3.2:1b model. It’s great for learning purposes.

Step 4: Pull a model

Inside your Ollama container, pull the llama3.2:1b model:

ollama pull llama3.2:1b

It may take a couple of minutes to download the model.

Step 4: Run the model locally

While still in the container shell, run the model via the CLI by typing:

ollama run llama3.2:1b

<aside> ⚠️

This command will load the model into memory. You will see an increase of about 1 GB in RAM usage.

</aside>

You can now interact with the model directly in the terminal and the model will reply:

“What is the difference between JavaScript and Java”. Pretty good answer for a small model.

“What is the difference between JavaScript and Java”. Pretty good answer for a small model.

To exit the chat, type /bye

<aside> ⚠️

The model will still be loaded in memory after leaving the chat. Ollama will remove it from memory after a couple of minutes of inactivity.

</aside>

<aside> 💡

Use ollama ps to see which models are loaded and ollama stop to unload models from memory.

</aside>

Step 5: Using the HTTP API

The Ollama container comes with a built-in HTTP API to be used on http://localhost:11434.

Generating a response Send HTTP Post request to: http://localhost:11434/api/generatewith the JSON Body:

{
  "model": "llama3.2:1b",
  "prompt": "What is the capital of the Netherlands?",
  "stream": false
}

<aside> ⌨️

Hands on: Send the request above using Postman or curl.

</aside>

In the example above, we turned off streaming to keep the response simple. It may take up to a minute to return. The time depends on the length of the LLM’s reply.

Learn more about the HTTP API in the Ollama API documentation

Step 6: Connecting with JavaScript

Good news! Interacting with Ollama using JavaScript can be done with the same openai package from Week 11. The only changes you need to make:

  1. Change the baseURL to your Ollama server
  2. Set the API key to any non empty string. You don’t need an API key for locally interact with Ollama
  3. Increase the timeout just in case for longer responses.

The code should look already familiar:

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "<http://localhost:11434/v1/>",
  apiKey: "not-important", // Can be any non-empty string
  timeout: 60000
});

const response = await openai.chat.completions.create({
  model: "llama3.2:1b",
  messages: [{ role: "user", content: "Why is the sky blue?. Give a quick explanation." }],
});

const reply = response.choices[0].message.content;
console.log(reply);

Bonus: Getting fancy with streaming

Waiting 30 seconds or even a minute for a response can be annoying. Luckily, the API has an option to stream data token by token, so you can start showing some response immediately.

The following code shows how to do it.

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "<http://localhost:11434/v1/>",
  apiKey: "not-important",
  timeout: 60000,
});

const stream = await openai.chat.completions.create({
  model: "llama3.2:1b",
  messages: [{ role: "user", content: "Why is the sky blue?" }],
  stream: true
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  process.stdout.write(content || "");
}

Notes from the code:

<aside> ❗

</aside>

Bonus: Running Ollama on your GPU

In this demo, we ran the model on the CPU. This works well for small models, but for larger models, the CPU cannot keep up. If you have a powerful graphics card (GPU) on your machine, you can have Ollama use it to speed up response generation significantly.

Learn more: