How to Build a Local AI Agent with Python & Ollama

Abhishek madoliya 5 Mar 2026 14 min read #Build a Local AI Agent with Python & Ollama
How to Build a Local AI Agent with Python & Ollama

Quick Answer

To build a local AI agent with Python, Ollama, LangChain, and RAG:

A complete guide to building a fully local, privacy-first AI agent using open-source tools — no cloud APIs, no sending your data anywhere, zero recurring cost.

  1. Install Ollama and pull a local LLM model
  2. Set up a Python virtual environment and install dependencies
  3. Load and chunk documents for your knowledge base
  4. Generate embeddings using a local embedding model
  5. Store vectors in ChromaDB (local vector database)
  6. Build a retrieval chain with LangChain
  7. Run your AI agent and query it locally

Why Run Your AI Agent Locally?

Developers who rely exclusively on cloud AI APIs — OpenAI, Anthropic Claude, Gemini — face a trio of real problems: recurring API costs, data privacy risks, and hard dependency on external infrastructure. The moment the API goes down or pricing changes, your application breaks or gets expensive.

There is a better approach: build a local AI agent that runs entirely on your machine. With Ollama serving open-source LLMs and LangChain orchestrating the retrieval pipeline, you can build a powerful, private AI assistant that:

  • Processes and retrieves from your own documents
  • Generates answers using a locally running large language model
  • Never sends any data to an external server
  • Costs exactly zero dollars beyond the hardware you already own

This guide walks you through every step of the process — from installing Ollama to running a full Retrieval-Augmented Generation (RAG) pipeline. If you are evaluating open-source models for this pipeline, check our guide to the best open-source LLMs for coding in 2026.

If you're exploring modern AI automation platforms, it's important to understand how different tools compare. For example, developers often compare OpenClaw and n8n when building automation workflows or AI agents. Read our detailed comparison to learn the key differences, features, and use cases: OpenClaw vs n8n: Which AI Automation Tool is Better in 2026? .

What Is a Local AI Agent?

A local AI agent is an application that uses a locally executed large language model to reason about information, retrieve context from a knowledge base, and respond to user queries — all without issuing a single network request to a cloud provider.

Three concepts underpin every local AI agent:

Large Language Model (LLM)

A neural network trained on massive text corpora that generates coherent, contextually relevant responses. Models like Llama 3, Mistral, and Qwen 2.5 are production-quality and can run on consumer hardware.

Retrieval-Augmented Generation (RAG)

RAG is an architectural pattern that injects relevant documents into the LLM's context window before it generates a response. Instead of relying solely on the model's training data, the agent first retrieves the most relevant chunks from your private knowledge base, then generates a grounded answer.

Agentic Orchestration (LangChain)

LangChain connects the retriever, the prompt template, the LLM, and the output parser into a coherent chain. It handles chunking strategies, retrieval logic, and chain composition so you can build complex workflows with minimal boilerplate.

Tech Stack Overview

Every component in this stack is open-source and runs completely offline once models are downloaded.

Tool Role Why This Choice
Python 3.11+ Primary language Best ecosystem for AI/ML tooling
Ollama Local LLM runtime One-command model downloads, OpenAI-compatible API
LangChain Agent orchestration Document loaders, splitters, retrievers, chains
ChromaDB Vector database Embedded, no server required, fast similarity search
mxbai-embed-large Embedding model State-of-the-art local embeddings via Ollama

Prefer working with cloud-assisted local tools? See our OpenRouter API + Next.js tutorial for a hybrid approach.

System Architecture

Before writing any code, it helps to understand how data flows through the agent at runtime.

User Query LangChain: Prompt Template + Chain Retriever (Similarity Search) ChromaDB (Local Vector Store) Ollama: Local LLM (Llama 3) Generated Response

At query time the agent embeds the user's question, searches ChromaDB for the top-k most similar document chunks, injects those chunks into the prompt, and passes the full prompt to the local LLM for generation. The entire inference loop takes place on your machine.

Prerequisites

  • Python 3.9 or higher (3.11+ recommended)
  • 8 GB RAM minimum — 16 GB for comfortable Llama 3 8B inference
  • Ollama installed (see Step 1)
  • Basic familiarity with the Python command line
  • A directory of text files, PDFs, or Markdown documents as your knowledge base

Pro Tip: GPU acceleration is not required but dramatically speeds up inference. Ollama automatically offloads layers to a CUDA or Metal GPU if one is detected. Even on CPU-only machines, Llama 3 8B delivers usable response times for document Q&A workloads.

Step 1 — Install Ollama and Download Models

Step 1 of 7

Ollama manages the entire local model lifecycle — download, quantization, and serving — behind a minimal CLI. It exposes an OpenAI-compatible REST API at http://localhost:11434.

Linux / macOS

Terminal
curl https://ollama.ai/install.sh | sh

Windows

Download the installer from ollama.com/download and run it. Ollama will add itself to your PATH automatically.

Pull the Models

Terminal
# Main language model (LLM) — Llama 3 8B
ollama pull llama3

# Embedding model for vector search
ollama pull mxbai-embed-large

The LLM handles generation. The embedding model converts text chunks and queries into dense vector representations so that ChromaDB can perform semantic similarity search.

Alternative models: mistral, qwen2.5:7b, and phi3 are lighter options that work equally well for document Q&A. Compare open-source model performance in our open-source LLMs for coding guide.

Step 2 — Set Up the Python Environment

Step 2 of 7

Isolate dependencies in a virtual environment to avoid conflicts with other projects.

Terminal
# Create and activate virtual environment
python -m venv venv

# Linux / macOS
source venv/bin/activate

# Windows
.\venv\Scripts\activate

Now install the required packages:

Terminal
pip install langchain langchain-ollama langchain-community chromadb
pip install tiktoken unstructured pypdf
Package Purpose
langchain Core chain and retriever abstractions
langchain-ollama Ollama LLM and embeddings integration
langchain-community Document loaders (PDF, Markdown, etc.)
chromadb Local embedded vector database
tiktoken Token counting for chunk sizing
pypdf PDF parsing for knowledge base ingestion

Set up the project directory structure:

Terminal
mkdir -p local-ai-agent/data
cd local-ai-agent
touch ingest.py agent.py

Step 3 — Load Documents for RAG

Step 3 of 7

Your agent's knowledge comes from documents you provide. Drop any text files, PDFs, or Markdown files inside the data/ directory. LangChain's document loaders convert them into a list of Document objects that can be embedded and stored.

ingest.py
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load all .txt and .md files from the data directory
loader = DirectoryLoader(
    "./data",
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"}
)

documents = loader.load()
print(f"Loaded {len(documents)} documents")

# Split into overlapping chunks for better retrieval accuracy
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64
)

chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} text chunks")

The chunk size (512 tokens) controls how much text each vector represents. Smaller chunks improve retrieval precision; larger chunks preserve more surrounding context. The overlap (64 tokens) prevents important context from being split across chunk boundaries.

Step 4 — Create Embeddings with Ollama

Step 4 of 7

Embeddings are high-dimensional vector representations of text. Semantically similar passages end up close together in vector space, which is what enables semantic search. mxbai-embed-large produces state-of-the-art embeddings entirely on your local machine.

ingest.py (continued)
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(
    model="mxbai-embed-large",
    base_url="http://localhost:11434"
)

# Quick test — verify the embedding model is running
sample_vector = embeddings.embed_query("Test query")
print(f"Embedding dimension: {len(sample_vector)}")

Why local embeddings matter: If you use OpenAI embeddings to index your documents, you are sending every document chunk to an external API — defeating the purpose of a private agent. mxbai-embed-large matches cloud embedding quality while staying entirely offline.

Step 5 — Store Vectors in ChromaDB

Step 5 of 7

ChromaDB is an embedded vector database that runs as a library inside your Python process. No separate database server is needed. It persists to disk so you only run the ingestion step once.

ingest.py (continued)
from langchain_community.vectorstores import Chroma

# Persist to disk so re-ingestion is unnecessary
db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

db.persist()
print("Vector store saved to ./chroma_db")

ChromaDB stores both the raw text chunks and their vector representations. At query time it computes the cosine distance between the query embedding and all stored vectors, returning the top-k most relevant chunks. This is the core retrieval mechanism behind RAG.

Step 6 — Build the Retrieval Chain

Step 6 of 7

The retrieval chain wires together the local LLM, the vector store retriever, and a prompt template into a single callable pipeline. When you pass a question, LangChain automatically fetches the relevant chunks and constructs the full prompt.

agent.py
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# ── Load the persisted vector store ──
embeddings = OllamaEmbeddings(model="mxbai-embed-large")

db = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}        # retrieve top 4 most relevant chunks
)

# ── Initialise the local LLM ──
llm = OllamaLLM(
    model="llama3",
    temperature=0.1,
    base_url="http://localhost:11434"
)

# ── Custom prompt template ──
prompt_template = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a knowledgeable assistant. Use the context below to answer the question accurately.
If the answer is not found in the context, say "I don't have enough information."

Context:
{context}

Question: {question}

Answer:"""
)

# ── Assemble the chain ──
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt_template},
    return_source_documents=True
)

The stuff chain type concatenates all retrieved documents into a single prompt. For very large retrieval sets, consider map_reduce or refine chain types which handle more tokens by summarising each chunk before synthesis.

Step 7 — Run the AI Agent

Step 7 of 7

With the retrieval chain assembled, add an interactive loop to your agent script so users can query the knowledge base from the terminal.

agent.py (continued)
def run_agent():
    print("\n╔══════════════════════════════════╗")
    print("║  Local AI Agent  |  Ollama + RAG  ║")
    print("╚══════════════════════════════════╝")
    print("Type 'exit' to quit.\n")

    while True:
        query = input("Ask your agent: ").strip()

        if query.lower() in ("exit", "quit"):
            print("Shutting down agent.")
            break

        if not query:
            continue

        result = qa_chain.invoke({"query": query})
        print(f"\nAnswer: {result['result']}")

        # Optionally show which source documents were used
        sources = {doc.metadata.get("source", "Unknown")
                   for doc in result["source_documents"]}
        print(f"Sources: {', '.join(sources)}\n")

if __name__ == "__main__":
    run_agent()

Running the Agent

First run ingestion to build the vector store:

Terminal
python ingest.py

Then launch the interactive agent:

Terminal
python agent.py

Example session:

Output
╔══════════════════════════════════╗
║  Local AI Agent  |  Ollama + RAG  ║
╚══════════════════════════════════╝
Type 'exit' to quit.

Ask your agent: What are the key benefits of RAG pipelines?

Answer: RAG pipelines improve LLM accuracy by grounding responses in
retrieved factual context, reducing hallucination and providing
up-to-date information beyond the model's training cutoff.

Sources: data/rag_overview.txt

Example Use Cases for Local AI Agents

Document Q&A

Query internal PDFs, wikis, and runbooks without exposing proprietary content.

Coding Assistant

Feed a codebase as documents; ask architecture and debugging questions locally.

Research Assistant

Index papers and reports for semantic search and automatic summarisation.

Customer Support Bot

Power an offline support chatbot from your product documentation.

Knowledge Base Chat

Serve employees with a private AI across internal company data.

Local Automation Agent

Combine with tool-calling to automate workflows without cloud dependencies.

Advantages and Limitations

Complete Privacy

No data leaves the machine. Critical for healthcare, legal, and financial data.

Zero API Cost

No per-token billing. Run millions of queries at fixed hardware cost.

Offline Capable

Once models are pulled, the agent runs with no internet connection required.

Fully Customisable

Swap models, prompts, retrievers, and tools without API contract constraints.

Hardware Requirements

8B parameter models need 8–16 GB RAM. Larger models require more VRAM.

Slower Inference

CPU inference is significantly slower than GPU-backed cloud APIs.

Model Quality Gap

Open-source 8B models are capable but don't yet match GPT-4 on complex reasoning.

Manual Updates

You are responsible for pulling updated model versions and re-ingesting documents.

Advanced Improvements

Once the basic pipeline is running, several upgrades materially improve agent quality:

LangGraph Multi-Step Agents

Replace the simple RetrievalQA chain with a LangGraph stateful agent that can call multiple tools in sequence — web search, code execution, file writing — before producing a final answer. This is the foundation of modern agentic orchestration.

Tool Calling and Function Execution

Ollama supports function calling on compatible models. You can give the agent tools like a Python REPL, a web scraper, or a terminal executor, enabling it to act rather than just retrieve. Explore this pattern in our Openclaw Claude Code setup guide.

Hybrid Search (BM25 + Vector)

Combine dense vector search with sparse BM25 keyword search using LangChain's EnsembleRetriever. Hybrid search outperforms pure vector search on queries containing specific terms like product names or error codes.

Persistent Conversation Memory

Add a ConversationBufferMemory or ConversationSummaryMemory to maintain multi-turn dialogue context across queries, enabling the agent to answer follow-up questions coherently.

Open-Source AI IDEs as Development Interfaces

Consider using an AI-native IDE to develop against your local agent. Our open-source AI IDEs guide for 2026 covers the best options for local agentic development workflows.

Common Errors and Fixes

Ollama connection refused

Ollama must be running as a background process. Start it with ollama serve in a separate terminal, then retry. On Windows, check that the Ollama system tray application is running. For detailed setup errors on Windows see the Openclaw Windows setup errors fix guide.

ChromaDB migration errors

If you upgrade ChromaDB versions, existing persistence directories may be incompatible. Delete ./chroma_db and re-run python ingest.py to rebuild.

Out of memory during inference

Switch to a smaller model: ollama pull llama3:8b-instruct-q4_K_M. The q4_K_M quantisation reduces VRAM usage by approximately 40% with minimal accuracy degradation.

Frequently Asked Questions

What is RAG in AI?

RAG (Retrieval-Augmented Generation) is an architectural pattern where an LLM retrieves relevant documents from an external knowledge base before generating a response. This grounds the model's answers in factual, up-to-date information and dramatically reduces hallucination.

Can AI agents run fully locally?

Yes. With Ollama handling local model inference and ChromaDB providing a local vector store, the entire pipeline — from embedding to generation — runs on your own hardware with zero external network requests.

Is LangChain required for a local RAG agent?

Not strictly required, but it significantly reduces boilerplate. You could build the retrieval chain manually using the Ollama Python SDK and ChromaDB directly, but LangChain's abstractions for document loading, text splitting, and chain composition save substantial development time.

Which Ollama model is best for RAG?

For most document Q&A tasks, llama3:8b offers the best balance of quality and performance on consumer hardware. mistral:7b is a strong alternative with faster inference. For embedding, mxbai-embed-large consistently outperforms nomic-embed-text on retrieval benchmarks.

How do I add PDF support to the knowledge base?

Replace TextLoader with PyPDFLoader from langchain_community.document_loaders. Use glob="**/*.pdf" in the DirectoryLoader. Ensure pypdf is installed via pip.

Can I use a different vector database?

Yes. LangChain supports FAISS, Qdrant, Weaviate, Milvus, and many others with an identical API. ChromaDB is the simplest starting point because it requires no external server process.

Official Resources

LangChain Documentation

Chains, retrievers, agents, and all integrations

Ollama GitHub

Source, model library, and API reference

ChromaDB Docs

Persistence, collections, and querying guide

mxbai-embed-large

Embedding model card and benchmarks

More from Cloudvyn AI Lab

Conclusion

Building a local AI agent using Python, Ollama, LangChain, and RAG is one of the most practical things you can do as a developer in 2026. You get a production-capable AI assistant that is fully private, costs nothing to run, works offline, and—crucially—is entirely under your control.

The stack covered here — Llama 3 via Ollama, ChromaDB for vector persistence, and LangChain for orchestration — is battle-tested and actively maintained. Start with the seven steps in this guide, get a basic RAG pipeline running against your own documents, and then extend it with LangGraph, tool calling, and multi-agent coordination as your requirements grow.

The shift from cloud-dependent AI to sovereign, on-premise inference is already underway. Local AI agents are not a curiosity — they're a foundational pattern for the next generation of privacy-first software. Now you know how to build one.