Local RAG

A fully local RAG (Retrieval-Augmented Generation) pipeline does the embedding, the indexing, and the generation on hardware you control. No data leaves your machines. Useful for privacy-sensitive workloads, on-prem deployments, edge / offline scenarios, or just learning the mechanics without API bills.

In 2026, this is a practical option. The quality is competitive (not frontier) and the operational footprint is manageable.

What "local" buys you

No data egress. Documents stay on your hardware; queries don't leave.
Cost predictability. Pay for hardware once; no per-token charges.
Offline capability. Works without internet.
No vendor lock-in. Swap any component independently.
Compliance. Easier data-residency story; no third-party processor.

What it doesn't buy:

Frontier quality. Open LLMs trail commercial frontiers by 6-12 months on hard tasks.
Zero ops. Self-hosted means real ops work.
Burst capacity. Capacity is bounded by your hardware.

The minimum stack

Three components:

Embedding model — locally hosted; embeds queries and document chunks.
Vector store — local index of chunk embeddings.
LLM — locally hosted; generates the answer from retrieved chunks.

Plus retrieval orchestration (the glue), chunking pipeline, optional reranker.

Component choices

Embedding model

Options (all local-friendly):

BGE-small / BGE-base / BGE-large (BAAI) — strong open embedding family. BGE-large is competitive with commercial.
e5-base / e5-large (Microsoft) — strong; well-supported.
gte-large / gte-multilingual — solid, multilingual variants.
Nomic-embed-v1.5 / v2 — Apache 2.0; long-context.
all-MiniLM-L6-v2 — small (60MB); fast; weaker but fine for small corpora.
Jina v3 — multilingual, multi-modal capable.

For most use cases: BGE-base or BGE-large is the safe default. Run via sentence-transformers library. CPU-friendly.

Vector store

pgvector — Postgres extension. Reuses your existing Postgres if you have one. Sub-millisecond at tens of millions of vectors.
Qdrant — single binary, Rust-based, strong filtering.
LanceDB — embedded; columnar; cosy with Python data workflows.
ChromaDB — Python-native; minimal setup.
Weaviate — fuller-featured; can run locally.
FAISS — bare-bones; the index, no server. Good for embedded use.
Just numpy — for under 100k vectors, in-memory cosine search is fine.

For most local deployments: pgvector or Qdrant.

LLM

See OpenSourceLLMs for the full landscape. For RAG specifically:

Smaller models often suffice. RAG provides the knowledge; the LLM just needs to read and reason over the retrieved context. A 7B-13B model with strong context handling works.
Strong long-context matters. Models that handle 32k-128k tokens process larger retrieved contexts.

Options:

Llama 3.1 8B — strong; well-supported.
Mistral Small (24B) — good quality / cost trade.
Qwen 2.5 7B / 14B — strong on reasoning; good for code.
Phi-3.5-medium — small, capable; runs on modest hardware.

For laptop deployment: 7B at int4 quantisation. For workstation: 13B-30B. For dedicated GPU: anything up to 70B int4 fits on an H100.

Reranker

bge-reranker-base / large — strong open reranker.
ms-marco-MiniLM-L-12-v2 — fast, decent.
Skip the reranker if you can't afford the latency; it's optional.

Hardware sizing

Laptop (16-32 GB RAM, modest GPU or none)

Embedding: BGE-small on CPU. A few embeddings per second.
Vector store: ChromaDB or pgvector with a few thousand to ~1M docs.
LLM: 7B int4 via llama.cpp. ~5 tok/s on CPU; ~30 tok/s on Apple Silicon M-series.

Suitable for: personal knowledge base; offline assistant; experimentation.

Workstation (64 GB RAM, RTX 4090 or equivalent)

Embedding: BGE-large on GPU. Hundreds per second.
Vector store: pgvector or Qdrant; tens of millions of docs.
LLM: 13B-30B at int4 via vLLM or llama.cpp. 30-100 tok/s.

Suitable for: small team's internal RAG; production for low-volume use.

Dedicated server (A100/H100, 128+ GB system RAM)

Embedding: any model at speed.
Vector store: 100M+ docs.
LLM: 70B at int4 via vLLM. 50+ tok/s; can serve concurrent requests.

Suitable for: production workloads, hundreds of users.

A concrete recipe

For a "chat with your documents" application:

1. Ingestion
   - Parse documents (text, PDF, HTML)
   - Chunk: 256-512 tokens with 10-20% overlap, on semantic boundaries
   - Embed each chunk via BGE-large
   - Insert into pgvector with metadata (document_id, page, etc.)

2. Indexing
   - HNSW index on embedding column
   - Optional: tsvector for BM25
   - GIN index on metadata for filters

3. Retrieval
   - Query → embed via BGE-large
   - Top-50 by cosine
   - Optional: BM25 top-50; RRF combine
   - Optional: rerank top-50 → top-10 via BGE-reranker
   - Select top 5-10 chunks to feed LLM

4. Generation
   - System prompt + retrieved chunks + user query
   - LLM (Qwen 2.5 14B or similar) at int4 via vLLM
   - Stream response

5. Citation
   - Track which chunks were retrieved
   - LLM cites by source label
   - UI shows expandable citations

This is the standard pattern. A weekend's work to a functional prototype; weeks to a polished product.

Where local RAG falls short

Frontier reasoning quality. Hard multi-step reasoning still favours commercial frontier models.
Multilingual breadth. Best multilingual coverage is in commercial models; some open ones are catching up.
Multimodal complexity. Vision-language local RAG works but trails commercial.
Operational maturity. Self-hosted serving has more pitfalls than calling an API.

For most internal-knowledge / customer-support / document-search use cases, local RAG is competitive. For a frontier-quality consumer assistant, commercial models still win.

Failure modes specific to local

Embedding model swap breaks the index. Embeddings from BGE-base aren't comparable to embeddings from e5. Pin the model version; reindex on changes.
Too-aggressive quantisation hurts retrieval more than generation. Embedding models are smaller; quantisation hurts proportionally more. Stick to FP16 or BF16 for embeddings.
LLM context overflow. Pretty 7B model has 32k context; you stuff 30 chunks in; it returns garbled output. Cap retrieved chunks at 5-10; aggressive chunk-summarisation if needed.
Stale models. Open-weights move fast; the model you picked 6 months ago is now mid-tier. Periodic re-evaluation.
No prompt cache. Local serving stacks have prompt caching but it's less mature than commercial. Plan for re-processing repeated prompts.

Pragmatic configuration

For a team starting in 2026:

embedder: BAAI/bge-large-en-v1.5
embed_dim: 1024
chunk_size: 384 tokens
chunk_overlap: 64 tokens
vector_store: pgvector with HNSW (m=16, ef_construction=200)
hybrid: true (BM25 via paradedb or tsvector + RRF)
reranker: BAAI/bge-reranker-large (top-50 → top-10)
llm: Qwen2.5-14B-Instruct at int4 via vLLM
serving: vLLM on a single H100; or llama.cpp on Apple M-series workstation
context: top 5 chunks + system prompt + query, ~3-4k tokens

This stack runs comfortably on a single workstation; serves a small team.