Open Source LLM Ecosystem
The open source LLM ecosystem has matured rapidly. Open models are competitive with closed APIs for many use cases. Tooling is solid. The question for most teams is when open makes sense, not whether it does.
This page maps the landscape.
Why open source
Cost
At high volume, self-hosted open models cost less than API calls.
Privacy
Data stays on your infrastructure. No API provider sees it.
Control
You choose the model, configuration, deployment. No surprise deprecations or behavior changes.
Customization
Fine-tuning, embeddings, full weight access. Things APIs can't offer.
Offline / edge
Some deployments must run without internet (regulated industries, offline products).
Skill development
Working with open models builds organizational ML capabilities.
Major model families
Llama (Meta)
The dominant open family.
- Llama 3.1: 8B, 70B, 405B
- Llama 3.2: 1B, 3B, 11B, 90B (multimodal)
- Llama 3.3: 70B
- Code Llama: code-specialized
License allows commercial use with conditions.
Mistral
- Mistral 7B: small, strong
- Mixtral 8x7B / 8x22B: MoE
- Mistral Large: closed but available via API
- Codestral: code
Apache 2.0 for older models; custom for newer.
Qwen (Alibaba)
- Qwen 2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
- Specialized: Coder, Math
- Strong at non-English languages
Apache 2.0 for many variants.
Gemma (Google)
- Gemma 2: 2B, 9B, 27B
- Distilled from Gemini
Custom Google license.
Phi (Microsoft)
- Phi-3 / Phi-4: small but strong
- Trained on heavily curated data
MIT license.
DeepSeek
- DeepSeek V2 / V3: large MoE
- DeepSeek-Coder
Strong technical capability.
Specialized models
- **Code**: Codestral, DeepSeek-Coder, StarCoder, Qwen-Coder
- **Embeddings**: BGE, E5, sentence-transformers, GTE
- **Vision**: CLIP, LLaVA, Qwen-VL
- **Speech**: Whisper, Parakeet
For specialized tasks, specialized models often beat large generalists.
Inference engines
llama.cpp
C++ library. CPU-friendly. Aggressive quantization.
The standard for running quantized LLMs on consumer hardware.
vLLM
Python/CUDA. Continuous batching, PagedAttention.
Standard for high-throughput GPU serving.
Text Generation Inference (TGI)
Hugging Face's serving framework. Similar capabilities to vLLM.
Ollama
Simplified deployment for local LLM use. Built on llama.cpp.
LM Studio
GUI for local inference. Good for non-technical users.
exLlamaV2 / TabbyAPI
Memory-efficient GPU inference, especially for quantized models.
MLC
Compiles models for many backends (mobile, browser, edge).
Quantization tools
GGUF (llama.cpp)
Format and quantization for CPU/edge inference.
Quants: Q2 (smallest, lowest quality) → Q8 (largest, highest quality).
Q4_K_M is a common balance.
GPTQ
GPU-friendly quantization. Common in HF model hub.
AWQ
Activation-aware. Often slightly better quality at same precision than GPTQ.
EXL2
For exllamav2. Variable per-layer precision.
bitsandbytes
Easiest path; native PyTorch support. 4-bit, 8-bit.
Fine-tuning
Full fine-tuning
Update all weights. Memory-intensive; not always best.
LoRA / QLoRA
Train small adapter weights. 100x fewer trainable parameters; quality nearly matches full fine-tuning for many tasks.
QLoRA: LoRA on quantized base. Runs on consumer GPUs.
Tools
- Hugging Face PEFT
- axolotl
- LLaMA-Factory
- Unsloth (faster training)
When fine-tuning beats prompting
- Lots of training data
- Specialized format / style
- Cost / latency requirements
- Repeated specific tasks
When prompting wins
- Limited examples
- Tasks change frequently
- Capability already exists in base model
Vector databases (for RAG)
Open source:
- **Qdrant**: Rust, strong feature set
- **Weaviate**: Go, integrated ML
- **Milvus**: scalable, mature
- **pgvector**: PostgreSQL extension; simplest if you have PG
- **Chroma**: Python-friendly, smaller scale
Commercial: Pinecone, Vespa.
Frameworks
LangChain
Python, JavaScript. Many integrations. Some criticism for over-abstraction.
LlamaIndex
Focused on RAG and indexing.
Haystack
Mature, modular pipelines.
DSPy
Programming model for LLM applications. Compile prompts.
Direct API calls
Often the right choice. Frameworks add abstraction overhead.
Hardware for self-hosting
Consumer
- RTX 4090: 24GB, runs 7B-13B comfortably, 30-70B with quantization
- M-series Mac: unified memory advantage; Mac Studio with 128GB+ runs 70B models
Workstation
- 2-4 RTX 4090: 70B at full precision
- A6000: 48GB
Server
- A100 (40/80GB), H100 (80GB)
- Multi-GPU for very large models
Cloud
- Hosted services (Replicate, Together, Modal, Fireworks)
- GPU rentals (Lambda, RunPod, Paperspace)
- Hyperscaler (AWS, GCP, Azure)
Quality vs closed APIs
The picture in late 2025/early 2026:
- **General reasoning**: closed (GPT-4, Claude) ahead but gap narrowing
- **Specific benchmarks**: open often competitive or better
- **Coding**: open (Qwen-Coder, Codestral) competitive
- **Long context**: closed ahead for very long context
- **Multimodal**: open catching up
For many production tasks, a 70B open model is sufficient.
Cost comparison
API
- GPT-4o: ~$2.50 / 1M input tokens
- Claude Sonnet: similar
- Smaller models: 10-100x cheaper per token
Self-hosted
Hardware + electricity + ops time.
A100 instance: ~$1-2/hour.
Throughput at 70B: ~1000-3000 tokens/sec with batching.
Breakeven: depends on volume. Often 10M+ tokens/day favors self-hosting if you have the ops capability.
Common failure patterns
Underestimating ops
Self-hosting LLMs is harder than running APIs.
Choosing model by benchmark
Benchmark performance doesn't always match your task.
Wrong quantization level
Q2 ruins quality. Q8 wastes memory. Q4-Q5 is usually right.
Frameworks over fundamentals
Reaching for LangChain when you need 50 lines of Python.
Skipping evaluation
Open models need evaluation on your specific task.
Outdated information
Ecosystem moves fast. 6-month-old advice may be stale.
Decision framework
Use open source when:
- High volume justifies hosting
- Privacy requirements
- Customization needed
- Cost-sensitive
Use closed APIs when:
- Need cutting-edge capability
- Low volume
- No ops capacity
- Prototyping
Many teams run hybrid: closed for hard tasks, open for high-volume ones.
Where to start
1. Try llama.cpp + GGUF quants on your laptop
2. Run a small model end-to-end
3. Evaluate on your task
4. Scale up only if needed
Further Reading
- [TransformerArchitecture](TransformerArchitecture) — Model foundation
- [AgentPromptEngineering](AgentPromptEngineering) — Agent patterns
- [PromptCaching](PromptCaching) — Cost optimization
- [Generative AI Hub](GenerativeAIHub) — Cluster index