Running local LLM models and building AI agents

Running large language models locally is easier than ever thanks to Ollama. Last weekend, I took on a task to run LLMs on my machine and create my first AI agents.

Ollama

Ollama is basically a runtime + package manager (like Node.js), built on top of C++ inference engines. It has a CLI+ server that acts as an LLM host and loads and runs models.

Models are mostly .guf files (from llama.cpp ecosystem). These are pre-trained, then quantized for a smaller size and faster local inference. These files contain neural weights, a tokenizer, and model metadata.

How Ollama handles models

Ollama's runtime is basically a wrapper for llama.cpp (the brain) with API support and a model-loading-engine. It uses a Modelfile to add prompt or fine-tune layers.


  FROM llama3:8b
  PARAMETER temperature 0.8
  SYSTEM "You are a AI assistant."

How a LLM works

When we provide a prompt to a LLM:

Step 1 (Tokenizer): Text is split into tokens (subwords).
Step 2 (Embedding lookup): Convert token ID into vectors.
Step 3 (Transformers): Transform vectors using attention mechanism and MLP.
Step 4 (Probability): Output probability distribution over next tokens.
Step 5 (Sampling): Pick the next token based on parameters (e.g. temperature)

Loop until stop (end of sequence token).

Using Ollama

Start by downloading Ollama and then a model.


  // Download a model
  ollama pull llama3:8b

  // Run the model
  ollama run llama3

Ollama when run exposes REST APIs like /generate and /chat.


  curl http://localhost:11434/api/generate -d '{
    "model": "llama3:8b",
    "prompt": "Explain quantum computing."
  }'

Creating our first agent

Our game plan: the agent takes the input, picks and executes an action, and provides a response back to the user. We'll start by creating a model.

  // model.py
  def run_local_model(prompt: str, model: str = "llama3:8b", max_retries: int = 3, temperature: float = 0.3) -> str:
    
    start = time.time()
    json_prompt = f"""
    {prompt}
    ...

and then listing all the possible actions.

  // actions.py
  class Actions:

    @staticmethod
    def action1(params):

    @staticmethod
    def action2(params):

Now our agent feeds this plan to the LLM. This class orchestrates the agent workflow by asking the LLM to create a step-by-step oplan based on the user's input and available actions.

  // agent_manager.py
  class AgentManager:
    ...
    async def run(self):

        # 1. Get a plan from the LLM
        plan_prompt = f"""
        You are an AI task planner.

        The user says:
        {self.user_input}

        Available actions and required parameters:

        1. api_call
        - method (string, required) e.g. "GET" or "POST"
        - url (string, required)
        - key (string, optional) - a keyword to search for in the API response

        2. echo
        - message (string, required)

Putting it all together

Let's build a UI frontend to access our agent. All we need is a simple HTML page that makes API calls to our backend. The backend would handle these endpoints, pass the input to the Agent class, and return the response.

Running llm locally

With Ollama, running LLMs locally is straightforward, and building AI agents is a fun way to explore their potential.

See complete code on Github

Posted on Aug 09, 2025