How to Build a Local AI Agent with Python & Ollama

Quick Answer
To build a local AI agent with Python, Ollama, LangChain, and RAG:
A complete guide to building a fully local, privacy-first AI agent using open-source tools — no cloud APIs, no sending your data anywhere, zero recurring cost.
- Install Ollama and pull a local LLM model
- Set up a Python virtual environment and install dependencies
- Load and chunk documents for your knowledge base
- Generate embeddings using a local embedding model
- Store vectors in ChromaDB (local vector database)
- Build a retrieval chain with LangChain
- Run your AI agent and query it locally
Why Run Your AI Agent Locally?
Developers who rely exclusively on cloud AI APIs — OpenAI, Anthropic Claude, Gemini — face a trio of real problems: recurring API costs, data privacy risks, and hard dependency on external infrastructure. The moment the API goes down or pricing changes, your application breaks or gets expensive.
There is a better approach: build a local AI agent that runs entirely on your machine. With Ollama serving open-source LLMs and LangChain orchestrating the retrieval pipeline, you can build a powerful, private AI assistant that:
- Processes and retrieves from your own documents
- Generates answers using a locally running large language model
- Never sends any data to an external server
- Costs exactly zero dollars beyond the hardware you already own
This guide walks you through every step of the process — from installing Ollama to running a full Retrieval-Augmented Generation (RAG) pipeline. If you are evaluating open-source models for this pipeline, check our guide to the best open-source LLMs for coding in 2026.
If you're exploring modern AI automation platforms, it's important to understand how different tools compare. For example, developers often compare OpenClaw and n8n when building automation workflows or AI agents. Read our detailed comparison to learn the key differences, features, and use cases: OpenClaw vs n8n: Which AI Automation Tool is Better in 2026? .
What Is a Local AI Agent?
A local AI agent is an application that uses a locally executed large language model to reason about information, retrieve context from a knowledge base, and respond to user queries — all without issuing a single network request to a cloud provider.
Three concepts underpin every local AI agent:
Large Language Model (LLM)
A neural network trained on massive text corpora that generates coherent, contextually relevant responses. Models like Llama 3, Mistral, and Qwen 2.5 are production-quality and can run on consumer hardware.
Retrieval-Augmented Generation (RAG)
RAG is an architectural pattern that injects relevant documents into the LLM's context window before it generates a response. Instead of relying solely on the model's training data, the agent first retrieves the most relevant chunks from your private knowledge base, then generates a grounded answer.
Agentic Orchestration (LangChain)
LangChain connects the retriever, the prompt template, the LLM, and the output parser into a coherent chain. It handles chunking strategies, retrieval logic, and chain composition so you can build complex workflows with minimal boilerplate.
Tech Stack Overview
Every component in this stack is open-source and runs completely offline once models are downloaded.
| Tool | Role | Why This Choice |
|---|---|---|
| Python 3.11+ | Primary language | Best ecosystem for AI/ML tooling |
| Ollama | Local LLM runtime | One-command model downloads, OpenAI-compatible API |
| LangChain | Agent orchestration | Document loaders, splitters, retrievers, chains |
| ChromaDB | Vector database | Embedded, no server required, fast similarity search |
| mxbai-embed-large | Embedding model | State-of-the-art local embeddings via Ollama |
Prefer working with cloud-assisted local tools? See our OpenRouter API + Next.js tutorial for a hybrid approach.
System Architecture
Before writing any code, it helps to understand how data flows through the agent at runtime.
At query time the agent embeds the user's question, searches ChromaDB for the top-k most similar document chunks, injects those chunks into the prompt, and passes the full prompt to the local LLM for generation. The entire inference loop takes place on your machine.
Prerequisites
- Python 3.9 or higher (3.11+ recommended)
- 8 GB RAM minimum — 16 GB for comfortable Llama 3 8B inference
- Ollama installed (see Step 1)
- Basic familiarity with the Python command line
- A directory of text files, PDFs, or Markdown documents as your knowledge base
Pro Tip: GPU acceleration is not required but dramatically speeds up inference. Ollama automatically offloads layers to a CUDA or Metal GPU if one is detected. Even on CPU-only machines, Llama 3 8B delivers usable response times for document Q&A workloads.
Step 1 — Install Ollama and Download Models
Ollama manages the entire local model lifecycle — download, quantization, and serving — behind a
minimal CLI. It exposes an OpenAI-compatible REST API at http://localhost:11434.
Linux / macOS
curl https://ollama.ai/install.sh | sh
Windows
Download the installer from ollama.com/download and run it. Ollama will add itself to your PATH automatically.
Pull the Models
# Main language model (LLM) — Llama 3 8B
ollama pull llama3
# Embedding model for vector search
ollama pull mxbai-embed-large
The LLM handles generation. The embedding model converts text chunks and queries into dense vector representations so that ChromaDB can perform semantic similarity search.
Alternative models: mistral, qwen2.5:7b, and
phi3 are lighter options that work equally well for document Q&A. Compare
open-source model performance in our
open-source LLMs for coding guide.
Step 2 — Set Up the Python Environment
Isolate dependencies in a virtual environment to avoid conflicts with other projects.
# Create and activate virtual environment
python -m venv venv
# Linux / macOS
source venv/bin/activate
# Windows
.\venv\Scripts\activate
Now install the required packages:
pip install langchain langchain-ollama langchain-community chromadb
pip install tiktoken unstructured pypdf
| Package | Purpose |
|---|---|
langchain |
Core chain and retriever abstractions |
langchain-ollama |
Ollama LLM and embeddings integration |
langchain-community |
Document loaders (PDF, Markdown, etc.) |
chromadb |
Local embedded vector database |
tiktoken |
Token counting for chunk sizing |
pypdf |
PDF parsing for knowledge base ingestion |
Set up the project directory structure:
mkdir -p local-ai-agent/data
cd local-ai-agent
touch ingest.py agent.py
Step 3 — Load Documents for RAG
Your agent's knowledge comes from documents you provide. Drop any text files, PDFs, or Markdown
files inside the data/ directory. LangChain's document loaders convert them into
a list of Document objects that can be embedded and stored.
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load all .txt and .md files from the data directory
loader = DirectoryLoader(
"./data",
glob="**/*.txt",
loader_cls=TextLoader,
loader_kwargs={"encoding": "utf-8"}
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# Split into overlapping chunks for better retrieval accuracy
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} text chunks")
The chunk size (512 tokens) controls how much text each vector represents. Smaller chunks improve retrieval precision; larger chunks preserve more surrounding context. The overlap (64 tokens) prevents important context from being split across chunk boundaries.
Step 4 — Create Embeddings with Ollama
Embeddings are high-dimensional vector representations of text. Semantically similar passages
end up close together in vector space, which is what enables semantic search.
mxbai-embed-large produces state-of-the-art embeddings entirely on your local machine.
from langchain_ollama import OllamaEmbeddings
embeddings = OllamaEmbeddings(
model="mxbai-embed-large",
base_url="http://localhost:11434"
)
# Quick test — verify the embedding model is running
sample_vector = embeddings.embed_query("Test query")
print(f"Embedding dimension: {len(sample_vector)}")
Why local embeddings matter: If you use OpenAI embeddings to index your
documents,
you are sending every document chunk to an external API — defeating the purpose of a private
agent.
mxbai-embed-large matches cloud embedding quality while staying entirely offline.
Step 5 — Store Vectors in ChromaDB
ChromaDB is an embedded vector database that runs as a library inside your Python process. No separate database server is needed. It persists to disk so you only run the ingestion step once.
from langchain_community.vectorstores import Chroma
# Persist to disk so re-ingestion is unnecessary
db = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
db.persist()
print("Vector store saved to ./chroma_db")
ChromaDB stores both the raw text chunks and their vector representations. At query time it computes the cosine distance between the query embedding and all stored vectors, returning the top-k most relevant chunks. This is the core retrieval mechanism behind RAG.
Step 6 — Build the Retrieval Chain
The retrieval chain wires together the local LLM, the vector store retriever, and a prompt template into a single callable pipeline. When you pass a question, LangChain automatically fetches the relevant chunks and constructs the full prompt.
from langchain_ollama import OllamaLLM, OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# ── Load the persisted vector store ──
embeddings = OllamaEmbeddings(model="mxbai-embed-large")
db = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)
retriever = db.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # retrieve top 4 most relevant chunks
)
# ── Initialise the local LLM ──
llm = OllamaLLM(
model="llama3",
temperature=0.1,
base_url="http://localhost:11434"
)
# ── Custom prompt template ──
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template="""You are a knowledgeable assistant. Use the context below to answer the question accurately.
If the answer is not found in the context, say "I don't have enough information."
Context:
{context}
Question: {question}
Answer:"""
)
# ── Assemble the chain ──
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt_template},
return_source_documents=True
)
The stuff chain type concatenates all retrieved documents into a single prompt.
For very large retrieval sets, consider map_reduce or refine chain
types which handle more tokens by summarising each chunk before synthesis.
Step 7 — Run the AI Agent
With the retrieval chain assembled, add an interactive loop to your agent script so users can query the knowledge base from the terminal.
def run_agent():
print("\n╔══════════════════════════════════╗")
print("║ Local AI Agent | Ollama + RAG ║")
print("╚══════════════════════════════════╝")
print("Type 'exit' to quit.\n")
while True:
query = input("Ask your agent: ").strip()
if query.lower() in ("exit", "quit"):
print("Shutting down agent.")
break
if not query:
continue
result = qa_chain.invoke({"query": query})
print(f"\nAnswer: {result['result']}")
# Optionally show which source documents were used
sources = {doc.metadata.get("source", "Unknown")
for doc in result["source_documents"]}
print(f"Sources: {', '.join(sources)}\n")
if __name__ == "__main__":
run_agent()
Running the Agent
First run ingestion to build the vector store:
python ingest.py
Then launch the interactive agent:
python agent.py
Example session:
╔══════════════════════════════════╗
║ Local AI Agent | Ollama + RAG ║
╚══════════════════════════════════╝
Type 'exit' to quit.
Ask your agent: What are the key benefits of RAG pipelines?
Answer: RAG pipelines improve LLM accuracy by grounding responses in
retrieved factual context, reducing hallucination and providing
up-to-date information beyond the model's training cutoff.
Sources: data/rag_overview.txt
Example Use Cases for Local AI Agents
Query internal PDFs, wikis, and runbooks without exposing proprietary content.
Feed a codebase as documents; ask architecture and debugging questions locally.
Index papers and reports for semantic search and automatic summarisation.
Power an offline support chatbot from your product documentation.
Serve employees with a private AI across internal company data.
Combine with tool-calling to automate workflows without cloud dependencies.
Explore More AI Automation Guides
- Build the Openclaw AI Assistant — a fully featured local AI agent project
- Openclaw WhatsApp Automation Guide 2026 — automate messaging with a local agent
- AI Tools for JavaScript Developers 2026 — extend your stack with AI-native tooling
Advantages and Limitations
No data leaves the machine. Critical for healthcare, legal, and financial data.
No per-token billing. Run millions of queries at fixed hardware cost.
Once models are pulled, the agent runs with no internet connection required.
Swap models, prompts, retrievers, and tools without API contract constraints.
8B parameter models need 8–16 GB RAM. Larger models require more VRAM.
CPU inference is significantly slower than GPU-backed cloud APIs.
Open-source 8B models are capable but don't yet match GPT-4 on complex reasoning.
You are responsible for pulling updated model versions and re-ingesting documents.
Advanced Improvements
Once the basic pipeline is running, several upgrades materially improve agent quality:
LangGraph Multi-Step Agents
Replace the simple RetrievalQA chain with a LangGraph stateful
agent
that can call multiple tools in sequence — web search, code execution, file writing — before
producing a final answer. This is the foundation of modern agentic orchestration.
Tool Calling and Function Execution
Ollama supports function calling on compatible models. You can give the agent tools like a Python REPL, a web scraper, or a terminal executor, enabling it to act rather than just retrieve. Explore this pattern in our Openclaw Claude Code setup guide.
Hybrid Search (BM25 + Vector)
Combine dense vector search with sparse BM25 keyword search using LangChain's
EnsembleRetriever. Hybrid search outperforms pure vector search on queries
containing specific terms like product names or error codes.
Persistent Conversation Memory
Add a ConversationBufferMemory or ConversationSummaryMemory
to maintain multi-turn dialogue context across queries, enabling the agent to answer
follow-up questions coherently.
Open-Source AI IDEs as Development Interfaces
Consider using an AI-native IDE to develop against your local agent. Our open-source AI IDEs guide for 2026 covers the best options for local agentic development workflows.
Common Errors and Fixes
Ollama connection refused
Ollama must be running as a background process. Start it with ollama serve in a
separate terminal, then retry. On Windows, check that the Ollama system tray application is
running.
For detailed setup errors on Windows see the
Openclaw Windows setup errors fix guide.
ChromaDB migration errors
If you upgrade ChromaDB versions, existing persistence directories may be incompatible.
Delete ./chroma_db and re-run python ingest.py to rebuild.
Out of memory during inference
Switch to a smaller model: ollama pull llama3:8b-instruct-q4_K_M.
The q4_K_M quantisation reduces VRAM usage by approximately 40% with
minimal accuracy degradation.
Frequently Asked Questions
What is RAG in AI?
RAG (Retrieval-Augmented Generation) is an architectural pattern where an LLM retrieves relevant documents from an external knowledge base before generating a response. This grounds the model's answers in factual, up-to-date information and dramatically reduces hallucination.
Can AI agents run fully locally?
Yes. With Ollama handling local model inference and ChromaDB providing a local vector store, the entire pipeline — from embedding to generation — runs on your own hardware with zero external network requests.
Is LangChain required for a local RAG agent?
Not strictly required, but it significantly reduces boilerplate. You could build the retrieval chain manually using the Ollama Python SDK and ChromaDB directly, but LangChain's abstractions for document loading, text splitting, and chain composition save substantial development time.
Which Ollama model is best for RAG?
For most document Q&A tasks, llama3:8b offers the best balance of quality and
performance on consumer hardware. mistral:7b is a strong alternative with faster
inference. For embedding, mxbai-embed-large consistently outperforms
nomic-embed-text on retrieval benchmarks.
How do I add PDF support to the knowledge base?
Replace TextLoader with PyPDFLoader from
langchain_community.document_loaders. Use glob="**/*.pdf" in the
DirectoryLoader. Ensure pypdf is installed via pip.
Can I use a different vector database?
Yes. LangChain supports FAISS, Qdrant, Weaviate, Milvus, and many others with an identical API. ChromaDB is the simplest starting point because it requires no external server process.
Official Resources
Chains, retrievers, agents, and all integrations
Source, model library, and API reference
Persistence, collections, and querying guide
Embedding model card and benchmarks
More from Cloudvyn AI Lab
Related Tutorials You'll Find Useful
- OpenRouter API + Next.js Tutorial 2026 — use dozens of LLMs through a single API interface
- Best Open-Source AI IDEs 2026 — code with AI assistance in a privacy-first environment
- Best Open-Source LLMs for Coding — compare Llama, Mistral, Qwen, DeepSeek, and more
- AI Tools for JavaScript Developers 2026 — integrate local agents into your JS workflow
- Openclaw Claude Code Setup — AI-assisted local coding agent configuration
- WhatsApp Automation with Openclaw 2026 — automate messaging workflows with local AI
- Build the Openclaw AI Assistant — a production-grade local AI agent project
- Fix Openclaw Setup Errors on Windows — troubleshoot local AI agent installation issues
Conclusion
Building a local AI agent using Python, Ollama, LangChain, and RAG is one of the most practical things you can do as a developer in 2026. You get a production-capable AI assistant that is fully private, costs nothing to run, works offline, and—crucially—is entirely under your control.
The stack covered here — Llama 3 via Ollama, ChromaDB for vector persistence, and LangChain for orchestration — is battle-tested and actively maintained. Start with the seven steps in this guide, get a basic RAG pipeline running against your own documents, and then extend it with LangGraph, tool calling, and multi-agent coordination as your requirements grow.
The shift from cloud-dependent AI to sovereign, on-premise inference is already underway. Local AI agents are not a curiosity — they're a foundational pattern for the next generation of privacy-first software. Now you know how to build one.