AI Evaluation and Benchmarks

Evaluating AI systems is harder than it looks. Public benchmarks tell you something but rarely what you actually want to know. The model that wins on MMLU might be useless for your task; the one that's mediocre on benchmarks might be perfect.

This page is the working set for 2026: benchmarks that still discriminate, why they're imperfect, and how to evaluate models for your actual use case.

Why benchmarks fail

Three reasons benchmarks become misleading:

Saturation

When the top models all score 95%+, the benchmark stops differentiating. Once everyone's at the ceiling, "Model A is 0.3 points better than Model B" doesn't translate to differentiation in real use.

Examples by 2026:

HumanEval (basic Python coding) — saturated; almost every frontier model is above 90%.
HellaSwag — saturated.
GLUE — saturated for a while.
MMLU — close to saturated; top models in the high 80s.

Benchmarks at saturation are noise. New ones (MMLU-Pro, GPQA Diamond) replace them.

Contamination

Benchmark questions leak into pretraining data. Models then "ace" the test by recall, not capability.

Internet-scraped pretraining data contains lots of public benchmark content.
Even paraphrased benchmark items leak.
Models tested in 2024 against benchmarks released in 2020 are essentially being tested on memorisation.

Mitigations:

Live benchmarks like LiveCodeBench (released after model cutoffs).
Private benchmarks held back by the benchmark authors.
Per-model contamination tests via canary strings.

In 2026, expect any benchmark older than a year to have at least partial contamination.

Distribution mismatch

Public benchmarks test specific narrow distributions. Your application is a different distribution. Performance on benchmarks doesn't directly predict performance on your task.

Example: a model that aces MMLU (multiple-choice trivia) might fail at "extract action items from a meeting transcript" because the latter requires entirely different skills.

This is the most common failure: assuming benchmark scores mean anything about your application.

Benchmarks that still discriminate (2026)

Despite the above, some are still informative:

Knowledge / reasoning

MMLU-Pro — extension of MMLU; harder; less saturated. Distinguishes between mid-tier and frontier models.
GPQA Diamond — graduate-physics-level questions; hard to memorise. Strong differentiator at the frontier.
MATH benchmark — competition math. Still has signal.
AIME — math olympiad problems. Hard; reasoning-heavy.

Code

SWE-bench Verified — real-world coding tasks from GitHub issues. Hard to game; close to actual developer work. The benchmark for code agents.
LiveCodeBench — competitive programming from Codeforces; released after training cutoffs. Resists contamination.
HumanEval+ — extension of HumanEval with more rigorous tests. Less saturated than original.
BigCodeBench — broader code tasks beyond functional implementation.

Agent capability

τ-bench (TauBench) — agent capability in dialog tasks (retail, airline customer support). Tests real multi-turn use.
WebArena, VisualWebArena — agents acting on web pages. Tests real-world navigation and tool use.
SWE-bench Verified — also relevant here.
GAIA — general assistant tasks; multi-step reasoning + tool use.

Multimodal

MMMU — multi-discipline multimodal QA. Standard for vision+language.
MathVista — math reasoning with images.
DocVQA — document understanding.

Subjective / preference

LMArena (Chatbot Arena) — head-to-head human preference. Subject to style bias but captures something real about helpfulness.
MT-Bench — multi-turn conversation; LLM-judge with human-correlated calibration.

Hallucination / factuality

TruthfulQA — common misconceptions; tests refusal of plausible-but-wrong answers.
HaluEval — broader hallucination eval.
LongFact, FreshQA — for long-form and time-sensitive factuality.

Reasoning under context

NIAH (Needle in a Haystack) — find a fact buried in a long context. Tests long-context recall.
RULER, BABILong — more sophisticated long-context benchmarks.

How to read a benchmark result

When you see "Model X scores 78% on Y benchmark":

What's Y exactly? Most benchmarks have variants. MMLU 5-shot vs 0-shot vs CoT; numbers differ by 5-10 points.
What's the contamination story? Was Y in the training data?
What's the methodology? Greedy decoding, sampled, best-of-N, with reasoning mode? Each gives different numbers.
What's the comparison? "78% vs the prior best of 75%" is more informative than the absolute.
Was the eval prompt strong? Bad prompts produce bad scores; "did Model X get a fair shake."

Vendors often report cherry-picked configurations. Independent reports (Stanford HELM, the Hugging Face Open LLM Leaderboard, Lmsys Chatbot Arena) are more reliable than vendor blog posts.

Building your own eval

The single highest-leverage practice: maintain a held-out set of 100-500 real task inputs from your application, with expected outputs.

This is the eval that matters. Public benchmarks are for sanity checks; your eval is for decisions.

Steps:

Sample real production queries (or representative synthetic ones).
Label expected outputs. Human-labelled at first; later possibly LLM-judged with calibration.
Categorise into task subtypes for per-category breakdown.
Add adversarial cases — known-hard, known-edge-case examples.
Freeze and version. New eval runs always against the same set.

Run on every prompt, model, or pipeline change. Track scores over time; alert on regressions.

See LlmEvaluationMetrics for the metric details and AgentTesting for rollout-based variants for agents.

What "we evaluated against benchmarks" doesn't mean

Vendor claims that need scrutiny:

"State of the art on N benchmarks." On which benchmarks; what was the prior state.
"Comparable to GPT-4." On what tasks; in what conditions.
"Best for code." On HumanEval (saturated) or SWE-bench (more meaningful)?
"Surpasses humans." Almost always on a specific benchmark, often with caveats.

Be sceptical. Most marketing claims are about cherry-picked benchmark configurations.

Specific 2026 model selection guidance

When picking a model for production:

Run a small qualitative test on your task. Manually compare 5-10 outputs across candidates.
Consult LMArena for general capability ordering. Recent rankings.
Check task-specific benchmarks if available — SWE-bench for code, τ-bench for agents.
Run your eval on top candidates. This is the deciding factor.
Pilot on shadow traffic before committing.

Skip steps 2-3 only if you have a strong prior. The eval (step 4) is non-negotiable.

Eval-in-production

Benchmarks ship pre-deployment. In production, you also want continuous eval:

Sampled human review of outputs.
LLM-as-judge on production samples (calibrated).
Eval-set replay nightly against the deployed system. Catches regressions before users do.
A/B testing for model or prompt changes with primary metrics.

See AiObservabilityInProduction.

Failure modes in evaluation pipelines

Eval set in training. Your held-out set was used for prompt examples; numbers lie.
Cherry-picked rollouts. "I tried it once and it worked." Single rollouts are meaningless; aggregate.
Judge bias not calibrated. LLM judge agrees with you / longer responses; disagree with humans.
No cost tracking. Quality went up 5%; cost went up 50%; net loss.
Static eval set. Real distribution shifts; eval set becomes irrelevant. Refresh periodically.

A pragmatic eval stack

For a team starting fresh:

A 100-task eval set from real production queries; labelled; versioned.
A simple harness that runs the set on a model, logs results, computes metrics.
Track: success rate, cost, latency, per-category breakdown.
Run on every model / prompt change.
Quarterly refresh — add new edge cases, retire saturated ones.
LLM-judge for fuzzy tasks, calibrated against human labels.

A week of work; immediate decision-making clarity.