Open Source LLM Ecosystem

The open source LLM ecosystem has matured rapidly. Open models are competitive with closed APIs for many use cases. Tooling is solid. The question for most teams is when open makes sense, not whether it does.

This page maps the landscape.

Why open source

Cost

At high volume, self-hosted open models cost less than API calls.

Privacy

Data stays on your infrastructure. No API provider sees it.

Control

You choose the model, configuration, deployment. No surprise deprecations or behavior changes.

Customization

Fine-tuning, embeddings, full weight access. Things APIs can't offer.

Offline / edge

Some deployments must run without internet (regulated industries, offline products).

Skill development

Working with open models builds organizational ML capabilities.

Major model families

Llama (Meta)

The dominant open family.

Llama 3.1: 8B, 70B, 405B
Llama 3.2: 1B, 3B, 11B, 90B (multimodal)
Llama 3.3: 70B
Code Llama: code-specialized

License allows commercial use with conditions.

Mistral

Mistral 7B: small, strong
Mixtral 8x7B / 8x22B: MoE
Mistral Large: closed but available via API
Codestral: code

Apache 2.0 for older models; custom for newer.

Qwen (Alibaba)

Qwen 2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
Specialized: Coder, Math
Strong at non-English languages

Apache 2.0 for many variants.

Gemma (Google)

Gemma 2: 2B, 9B, 27B
Distilled from Gemini

Custom Google license.

Phi (Microsoft)

Phi-3 / Phi-4: small but strong
Trained on heavily curated data

MIT license.

DeepSeek

DeepSeek V2 / V3: large MoE
DeepSeek-Coder

Strong technical capability.

Specialized models

Code: Codestral, DeepSeek-Coder, StarCoder, Qwen-Coder
Embeddings: BGE, E5, sentence-transformers, GTE
Vision: CLIP, LLaVA, Qwen-VL
Speech: Whisper, Parakeet

For specialized tasks, specialized models often beat large generalists.

Inference engines

llama.cpp

C++ library. CPU-friendly. Aggressive quantization.

The standard for running quantized LLMs on consumer hardware.

vLLM

Python/CUDA. Continuous batching, PagedAttention.

Standard for high-throughput GPU serving.

Text Generation Inference (TGI)

Hugging Face's serving framework. Similar capabilities to vLLM.

Ollama

Simplified deployment for local LLM use. Built on llama.cpp.

LM Studio

GUI for local inference. Good for non-technical users.

exLlamaV2 / TabbyAPI

Memory-efficient GPU inference, especially for quantized models.

MLC

Compiles models for many backends (mobile, browser, edge).

Quantization tools

GGUF (llama.cpp)

Format and quantization for CPU/edge inference.

Quants: Q2 (smallest, lowest quality) → Q8 (largest, highest quality).

Q4_K_M is a common balance.

GPTQ

GPU-friendly quantization. Common in HF model hub.

AWQ

Activation-aware. Often slightly better quality at same precision than GPTQ.

EXL2

For exllamav2. Variable per-layer precision.

bitsandbytes

Easiest path; native PyTorch support. 4-bit, 8-bit.

Fine-tuning

Full fine-tuning

Update all weights. Memory-intensive; not always best.

LoRA / QLoRA

Train small adapter weights. 100x fewer trainable parameters; quality nearly matches full fine-tuning for many tasks.

QLoRA: LoRA on quantized base. Runs on consumer GPUs.

Tools

Hugging Face PEFT
axolotl
LLaMA-Factory
Unsloth (faster training)

When fine-tuning beats prompting

Lots of training data
Specialized format / style
Cost / latency requirements
Repeated specific tasks

When prompting wins

Limited examples
Tasks change frequently
Capability already exists in base model

Vector databases (for RAG)

Open source:

Qdrant: Rust, strong feature set
Weaviate: Go, integrated ML
Milvus: scalable, mature
pgvector: PostgreSQL extension; simplest if you have PG
Chroma: Python-friendly, smaller scale

Commercial: Pinecone, Vespa.

Frameworks

LangChain

Python, JavaScript. Many integrations. Some criticism for over-abstraction.

LlamaIndex

Focused on RAG and indexing.

Haystack

Mature, modular pipelines.

DSPy

Programming model for LLM applications. Compile prompts.

Direct API calls

Often the right choice. Frameworks add abstraction overhead.

Hardware for self-hosting

Consumer

RTX 4090: 24GB, runs 7B-13B comfortably, 30-70B with quantization
M-series Mac: unified memory advantage; Mac Studio with 128GB+ runs 70B models

Workstation

2-4 RTX 4090: 70B at full precision
A6000: 48GB

Server

A100 (40/80GB), H100 (80GB)
Multi-GPU for very large models

Cloud

Hosted services (Replicate, Together, Modal, Fireworks)
GPU rentals (Lambda, RunPod, Paperspace)
Hyperscaler (AWS, GCP, Azure)

Quality vs closed APIs

The picture in late 2025/early 2026:

General reasoning: closed (GPT-4, Claude) ahead but gap narrowing
Specific benchmarks: open often competitive or better
Coding: open (Qwen-Coder, Codestral) competitive
Long context: closed ahead for very long context
Multimodal: open catching up

For many production tasks, a 70B open model is sufficient.

Cost comparison

API

GPT-4o: ~$2.50 / 1M input tokens
Claude Sonnet: similar
Smaller models: 10-100x cheaper per token

Self-hosted

Hardware + electricity + ops time.

A100 instance: ~$1-2/hour.

Throughput at 70B: ~1000-3000 tokens/sec with batching.

Breakeven: depends on volume. Often 10M+ tokens/day favors self-hosting if you have the ops capability.

Common failure patterns

Underestimating ops

Self-hosting LLMs is harder than running APIs.

Choosing model by benchmark

Benchmark performance doesn't always match your task.

Wrong quantization level

Q2 ruins quality. Q8 wastes memory. Q4-Q5 is usually right.

Frameworks over fundamentals

Reaching for LangChain when you need 50 lines of Python.

Skipping evaluation

Open models need evaluation on your specific task.

Outdated information

Ecosystem moves fast. 6-month-old advice may be stale.

Decision framework

Use open source when:

High volume justifies hosting
Privacy requirements
Customization needed
Cost-sensitive

Use closed APIs when:

Need cutting-edge capability
Low volume
No ops capacity
Prototyping

Many teams run hybrid: closed for hard tasks, open for high-volume ones.

Where to start

Try llama.cpp + GGUF quants on your laptop
Run a small model end-to-end
Evaluate on your task
Scale up only if needed