Running local LLM models and building AI agents
Running large language models locally is easier than ever thanks to Ollama. Last weekend, I took on a task to run LLMs on my machine and create my first AI agents.
Ollama
Ollama is basically a runtime + package manager (like Node.js), built on top of C++ inference engines. It has a CLI+ server that acts as an LLM host and loads and runs models.
Models are mostly .guf files (from llama.cpp ecosystem). These are pre-trained, then quantized for a smaller size and faster local inference. These files contain neural weights, a tokenizer, and model metadata.

Ollama's runtime is basically a wrapper for llama.cpp (the brain) with API support and a model-loading-engine. It uses a Modelfile to add prompt or fine-tune layers.
FROM llama3:8b
PARAMETER temperature 0.8
SYSTEM "You are a AI assistant."

When we provide a prompt to a LLM:
Step 1 (Tokenizer): Text is split into tokens (subwords).
Step 2 (Embedding lookup): Convert token ID into vectors.
Step 3 (Transformers): Transform vectors using attention mechanism and MLP.
Step 4 (Probability): Output probability distribution over next tokens.
Step 5 (Sampling): Pick the next token based on parameters (e.g. temperature)
Loop until stop (end of sequence token).
Using Ollama
Start by downloading Ollama and then a model.
// Download a model
ollama pull llama3:8b
// Run the model
ollama run llama3
Ollama when run exposes REST APIs like /generate and /chat.
curl http://localhost:11434/api/generate -d '{
"model": "llama3:8b",
"prompt": "Explain quantum computing."
}'
Creating our first agent
Our game plan: the agent takes the input, picks and executes an action, and provides a response back to the user. We'll start by creating a model.
// model.py
def run_local_model(prompt: str, model: str = "llama3:8b", max_retries: int = 3, temperature: float = 0.3) -> str:
start = time.time()
json_prompt = f"""
{prompt}
...
and then listing all the possible actions.
// actions.py
class Actions:
@staticmethod
def action1(params):
@staticmethod
def action2(params):
Now our agent feeds this plan to the LLM. This class orchestrates the agent workflow by asking the LLM to create a step-by-step oplan based on the user's input and available actions.
// agent_manager.py
class AgentManager:
...
async def run(self):
# 1. Get a plan from the LLM
plan_prompt = f"""
You are an AI task planner.
The user says:
{self.user_input}
Available actions and required parameters:
1. api_call
- method (string, required) e.g. "GET" or "POST"
- url (string, required)
- key (string, optional) - a keyword to search for in the API response
2. echo
- message (string, required)
Putting it all together
Let's build a UI frontend to access our agent. All we need is a simple HTML page that makes API calls to our backend. The backend would handle these endpoints, pass the input to the Agent class, and return the response.

With Ollama, running LLMs locally is straightforward, and building AI agents is a fun way to explore their potential.