Local RAG
A fully local RAG (Retrieval-Augmented Generation) pipeline does the embedding, the indexing, and the generation on hardware you control. No data leaves your machines. Useful for privacy-sensitive workloads, on-prem deployments, edge / offline scenarios, or just learning the mechanics without API bills.
In 2026, this is a practical option. The quality is competitive (not frontier) and the operational footprint is manageable.
What "local" buys you
- **No data egress.** Documents stay on your hardware; queries don't leave.
- **Cost predictability.** Pay for hardware once; no per-token charges.
- **Offline capability.** Works without internet.
- **No vendor lock-in.** Swap any component independently.
- **Compliance.** Easier data-residency story; no third-party processor.
What it doesn't buy:
- **Frontier quality.** Open LLMs trail commercial frontiers by 6-12 months on hard tasks.
- **Zero ops.** Self-hosted means real ops work.
- **Burst capacity.** Capacity is bounded by your hardware.
The minimum stack
Three components:
1. **Embedding model** — locally hosted; embeds queries and document chunks.
2. **Vector store** — local index of chunk embeddings.
3. **LLM** — locally hosted; generates the answer from retrieved chunks.
Plus retrieval orchestration (the glue), chunking pipeline, optional reranker.
Component choices
Embedding model
Options (all local-friendly):
- **BGE-small / BGE-base / BGE-large** (BAAI) — strong open embedding family. BGE-large is competitive with commercial.
- **e5-base / e5-large** (Microsoft) — strong; well-supported.
- **gte-large / gte-multilingual** — solid, multilingual variants.
- **Nomic-embed-v1.5 / v2** — Apache 2.0; long-context.
- **all-MiniLM-L6-v2** — small (60MB); fast; weaker but fine for small corpora.
- **Jina v3** — multilingual, multi-modal capable.
For most use cases: BGE-base or BGE-large is the safe default. Run via `sentence-transformers` library. CPU-friendly.
Vector store
- **pgvector** — Postgres extension. Reuses your existing Postgres if you have one. Sub-millisecond at tens of millions of vectors.
- **Qdrant** — single binary, Rust-based, strong filtering.
- **LanceDB** — embedded; columnar; cosy with Python data workflows.
- **ChromaDB** — Python-native; minimal setup.
- **Weaviate** — fuller-featured; can run locally.
- **FAISS** — bare-bones; the index, no server. Good for embedded use.
- **Just numpy** — for under 100k vectors, in-memory cosine search is fine.
For most local deployments: pgvector or Qdrant.
LLM
See [OpenSourceLLMs]() for the full landscape. For RAG specifically:
- **Smaller models often suffice.** RAG provides the knowledge; the LLM just needs to read and reason over the retrieved context. A 7B-13B model with strong context handling works.
- **Strong long-context** matters. Models that handle 32k-128k tokens process larger retrieved contexts.
Options:
- **Llama 3.1 8B** — strong; well-supported.
- **Mistral Small** (24B) — good quality / cost trade.
- **Qwen 2.5 7B / 14B** — strong on reasoning; good for code.
- **Phi-3.5-medium** — small, capable; runs on modest hardware.
For laptop deployment: 7B at int4 quantisation. For workstation: 13B-30B. For dedicated GPU: anything up to 70B int4 fits on an H100.
Reranker
- **bge-reranker-base / large** — strong open reranker.
- **`ms-marco-MiniLM-L-12-v2`** — fast, decent.
- Skip the reranker if you can't afford the latency; it's optional.
Hardware sizing
Laptop (16-32 GB RAM, modest GPU or none)
- Embedding: BGE-small on CPU. A few embeddings per second.
- Vector store: ChromaDB or pgvector with a few thousand to ~1M docs.
- LLM: 7B int4 via llama.cpp. ~5 tok/s on CPU; ~30 tok/s on Apple Silicon M-series.
Suitable for: personal knowledge base; offline assistant; experimentation.
Workstation (64 GB RAM, RTX 4090 or equivalent)
- Embedding: BGE-large on GPU. Hundreds per second.
- Vector store: pgvector or Qdrant; tens of millions of docs.
- LLM: 13B-30B at int4 via vLLM or llama.cpp. 30-100 tok/s.
Suitable for: small team's internal RAG; production for low-volume use.
Dedicated server (A100/H100, 128+ GB system RAM)
- Embedding: any model at speed.
- Vector store: 100M+ docs.
- LLM: 70B at int4 via vLLM. 50+ tok/s; can serve concurrent requests.
Suitable for: production workloads, hundreds of users.
A concrete recipe
For a "chat with your documents" application:
```
1. Ingestion
- Parse documents (text, PDF, HTML)
- Chunk: 256-512 tokens with 10-20% overlap, on semantic boundaries
- Embed each chunk via BGE-large
- Insert into pgvector with metadata (document_id, page, etc.)
2. Indexing
- HNSW index on embedding column
- Optional: tsvector for BM25
- GIN index on metadata for filters
3. Retrieval
- Query → embed via BGE-large
- Top-50 by cosine
- Optional: BM25 top-50; RRF combine
- Optional: rerank top-50 → top-10 via BGE-reranker
- Select top 5-10 chunks to feed LLM
4. Generation
- System prompt + retrieved chunks + user query
- LLM (Qwen 2.5 14B or similar) at int4 via vLLM
- Stream response
5. Citation
- Track which chunks were retrieved
- LLM cites by source label
- UI shows expandable citations
```
This is the standard pattern. A weekend's work to a functional prototype; weeks to a polished product.
Where local RAG falls short
- **Frontier reasoning quality.** Hard multi-step reasoning still favours commercial frontier models.
- **Multilingual breadth.** Best multilingual coverage is in commercial models; some open ones are catching up.
- **Multimodal complexity.** Vision-language local RAG works but trails commercial.
- **Operational maturity.** Self-hosted serving has more pitfalls than calling an API.
For most internal-knowledge / customer-support / document-search use cases, local RAG is competitive. For a frontier-quality consumer assistant, commercial models still win.
Failure modes specific to local
- **Embedding model swap breaks the index.** Embeddings from BGE-base aren't comparable to embeddings from e5. Pin the model version; reindex on changes.
- **Too-aggressive quantisation hurts retrieval more than generation.** Embedding models are smaller; quantisation hurts proportionally more. Stick to FP16 or BF16 for embeddings.
- **LLM context overflow.** Pretty 7B model has 32k context; you stuff 30 chunks in; it returns garbled output. Cap retrieved chunks at 5-10; aggressive chunk-summarisation if needed.
- **Stale models.** Open-weights move fast; the model you picked 6 months ago is now mid-tier. Periodic re-evaluation.
- **No prompt cache.** Local serving stacks have prompt caching but it's less mature than commercial. Plan for re-processing repeated prompts.
Pragmatic configuration
For a team starting in 2026:
```yaml
embedder: BAAI/bge-large-en-v1.5
embed_dim: 1024
chunk_size: 384 tokens
chunk_overlap: 64 tokens
vector_store: pgvector with HNSW (m=16, ef_construction=200)
hybrid: true (BM25 via paradedb or tsvector + RRF)
reranker: BAAI/bge-reranker-large (top-50 → top-10)
llm: Qwen2.5-14B-Instruct at int4 via vLLM
serving: vLLM on a single H100; or llama.cpp on Apple M-series workstation
context: top 5 chunks + system prompt + query, ~3-4k tokens
```
This stack runs comfortably on a single workstation; serves a small team.
Further reading
- [RagImplementationPatterns]() — RAG patterns generally
- [OpenSourceLLMs]() — picking the LLM
- [VectorDatabases]() — substrate detail
- [HybridRetrieval]() — the fusion step
- [RunningLocalLlms]() — LLM-specific hosting