Open Source LLMs

By 2026, open-weights LLMs are competitive with frontier commercial models on most tasks below the absolute frontier. The 6-12 month gap to GPT-class / Claude-class quality persists for the hardest reasoning tasks, but for most production work, open-weights is a credible choice.

This page is the state of the field, the decisions, and the operational reality of running these models.

"Open source" is overloaded

Three distinct things often called "open source":

Open weights — model parameters published, usable for inference. License terms vary.
Open training code — the code used to train the model is published; reproducible.
Open data — the training data is published.

Most "open source" LLMs are open-weights only. Truly open (weights + code + data) models exist (OLMo, Pythia) but lag in capability. For practical use, "open weights" is what matters.

License variations matter:

Apache 2.0 / MIT — fully permissive (Mistral original models, DeepSeek, Qwen). Use commercially without restriction.
Llama Community License — permissive for most commercial use; restrictions above 700M MAU; explicit AUP.
Custom non-commercial licenses — research only (e.g., older OpenAI Whisper variants for some uses).

Read the license. Treat "open weights" as a starting point; check the specific model.

The 2026 leaders

Meta Llama family

Llama 3.1, 3.2, 3.3 family — strong across sizes (8B, 70B, 405B). Decent reasoning, good multilingual.
Llama 4 (released 2025) — current frontier of the family.

Strengths: ecosystem maturity (every tool supports Llama format); broad community fine-tunes; strong baseline.

Weaknesses: license has commercial restrictions for very large deployments; not the absolute strongest at any given size.

Mistral

Mistral Small / Medium / Large — recent names; capable; Apache 2.0 for many variants.
Mistral Codestral — code-focused.
Mixtral — mixture-of-experts; strong quality per inference cost.

Strengths: efficient; Apache 2.0 (in most variants); European data sovereignty option.

Qwen (Alibaba)

Qwen 2.5, Qwen 3 — competitive frontier-adjacent. Strong on multilingual, code, math.
Qwen-VL, Qwen-Audio — multimodal variants.
Qwen Coder — code specialist.

Strengths: regularly tops open-weights leaderboards; strong Chinese capabilities; permissive license for most variants.

DeepSeek

DeepSeek V3 — large MoE; competitive with GPT-class on many benchmarks.
DeepSeek R1 — reasoning model; open weights for reasoning is rare.
DeepSeek Coder — strong code model.

Strengths: aggressive on reasoning; open R1 was a notable release. License is permissive.

Smaller specialists

Phi family (Microsoft) — small, surprisingly capable. Good for edge.
Gemma (Google) — small to mid-size; fast.
OLMo (AI2) — fully open including data and code.
StarCoder, StarCoder2 — code models.
Code Llama (deprecated; superseded by Llama with code training).

For a given size point, the leader rotates. Check leaderboards monthly during active eval.

Picking by use case

Use case	Reasonable picks (early 2026)
General chat, mid-quality	Mistral Small, Qwen 2.5 7B
General chat, high-quality	Llama 70B, Mistral Large, Qwen 72B
Frontier reasoning	DeepSeek R1, Qwen reasoning variants
Code generation	Qwen Coder, DeepSeek Coder, Codestral
Multimodal (image)	Qwen-VL, Llava family, MoE-VL
Edge / mobile	Gemma 2B, Phi-3.5-mini
On-device assistant	Llama 3.2 1B / 3B, Phi-3.5
Truly open (weights+code+data)	OLMo
Multilingual non-English	Qwen, Aya (Cohere)

These rotate every few months. Recheck before adopting.

When self-hosting wins

The economic break-even depends on traffic shape:

Sustained high volume — a model running 20+ RPS on dedicated hardware is cheaper than API calls at typical commercial pricing.
Latency-critical — colocated GPU inference returns in 100ms; commercial APIs vary; you control yours.
Privacy / compliance — data can't leave your VPC; on-prem is the only option.
Custom fine-tunes — you have a fine-tuned model; commercial APIs don't host yours.
Edge / on-device — small models on user devices; no remote inference.

For these, self-hosting is the right call.

When commercial APIs win

Bursty traffic — provisioning enough GPU for peak wastes the rest.
Frontier quality matters — open-weights still trails frontier commercial on hard reasoning.
You don't want to operate inference — it's real ops work.
Cross-cutting capabilities (vision, code, function calling, long context) bundled in one API.

For these, pay for the API. The self-host operational cost outweighs the API savings.

Hardware sizing

Approximate VRAM for inference:

Model	FP16	Int8	Int4
7B	~14 GB	~8 GB	~5 GB
13B	~26 GB	~14 GB	~8 GB
70B	~140 GB	~80 GB	~40 GB
405B	~810 GB	~440 GB	~220 GB
MoE 8x22B	~280 GB (all loaded; ~80 GB activated)	scaled

These are rough; concrete numbers depend on context size, KV cache, batching.

Hardware:

Consumer: RTX 4090 (24 GB) — 7B FP16, 13B INT8, 70B INT4 with offloading.
Workstation: A6000 (48 GB), or pair of 4090s — 70B INT4, mid-size FP16.
Datacenter: H100 (80 GB), A100 (40 / 80 GB), H200 (141 GB) — large models.
Apple Silicon: M2/M3/M4 Ultra with up to 192 GB unified memory — surprising option for medium models on macOS.

For most teams: an A100 80 GB or H100 serves a 70B at INT8 with reasonable batching. ~$3-5/hour cloud, or ~$30k purchase.

Serving stacks

vLLM — high-throughput; PagedAttention; the production default.
TGI (Text Generation Inference) — Hugging Face's serving stack; mature.
TensorRT-LLM — NVIDIA-optimised; fastest on NVIDIA hardware; more complex.
llama.cpp — CPU + Apple Silicon + lighter GPU; widely used for laptop / edge.
MLX — Apple Silicon; clean.
MLC-LLM — multi-platform; targets diverse hardware.
SGLang — newer; structured-output-friendly.

For most production GPU serving in 2026: vLLM is the safe default. TensorRT-LLM if you've maxed vLLM and need more throughput.

Fine-tuning open models

LoRA / QLoRA fine-tuning on top of open-weights base. Standard tooling: HuggingFace trl, peft, axolotl, unsloth. See LLMFineTuning.

The full-finetune path (modify all weights) is rare unless you have unusual budget. LoRA suffices for nearly every domain adaptation use case.

A LoRA adapter is small (tens of MB); deploy alongside the base model. Multiple LoRAs per base = multiple specialised models from one inference setup.

Quantisation in production

Most production self-hosted serving uses 4-bit or 8-bit quantisation. AWQ or GPTQ for GPU serving; GGUF for CPU / Apple Silicon. See ModelQuantization.

Quality cost: 1-3% on most benchmarks at 4-bit. Memory savings: 4×. Throughput gain: 2-4× (memory-bandwidth-bound LLM inference).

What's coming

Trends through 2026:

More efficient MoE — better routing, sparser activation. Quality per inference FLOP keeps improving.
Reasoning-mode open weights — DeepSeek R1 was the start; expect more.
Long context efficiency — Mamba / SSM variants; sparse attention; ring attention. Open-weights long-context becomes routine.
Vision-language merge — multimodal-by-default open models.
Stronger small models — 3-7B models reaching what 70B did a year ago. Edge inference becomes more capable.

The cadence of new releases is fast; lock in for 6 months at most before re-evaluating.

A pragmatic recipe

For a team adopting open-weights:

Pilot with a few models — Llama 70B, Mistral Large, Qwen 72B at minimum.
Run your eval on each.
Pick the strongest for your task (often surprises; not always the latest).
Quantise to 4-bit AWQ or 8-bit GPTQ.
Serve via vLLM on appropriate hardware.
Set up observability — see AiObservabilityInProduction.
Build a periodic re-eval habit — check new model releases monthly.

Two weeks of work; production-grade open-weights serving.