AI Evaluation and Benchmarks
Evaluating AI systems is harder than it looks. Public benchmarks tell you something but rarely what you actually want to know. The model that wins on MMLU might be useless for your task; the one that's mediocre on benchmarks might be perfect.
This page is the working set for 2026: benchmarks that still discriminate, why they're imperfect, and how to evaluate models for your actual use case.
Why benchmarks fail
Three reasons benchmarks become misleading:
Saturation
When the top models all score 95%+, the benchmark stops differentiating. Once everyone's at the ceiling, "Model A is 0.3 points better than Model B" doesn't translate to differentiation in real use.
Examples by 2026:
- HumanEval (basic Python coding) — saturated; almost every frontier model is above 90%.
- HellaSwag — saturated.
- GLUE — saturated for a while.
- MMLU — close to saturated; top models in the high 80s.
Benchmarks at saturation are noise. New ones (MMLU-Pro, GPQA Diamond) replace them.
Contamination
Benchmark questions leak into pretraining data. Models then "ace" the test by recall, not capability.
- Internet-scraped pretraining data contains lots of public benchmark content.
- Even paraphrased benchmark items leak.
- Models tested in 2024 against benchmarks released in 2020 are essentially being tested on memorisation.
Mitigations:
- **Live benchmarks** like LiveCodeBench (released after model cutoffs).
- **Private benchmarks** held back by the benchmark authors.
- **Per-model contamination tests** via canary strings.
In 2026, expect any benchmark older than a year to have at least partial contamination.
Distribution mismatch
Public benchmarks test specific narrow distributions. Your application is a different distribution. Performance on benchmarks doesn't directly predict performance on your task.
Example: a model that aces MMLU (multiple-choice trivia) might fail at "extract action items from a meeting transcript" because the latter requires entirely different skills.
This is the most common failure: assuming benchmark scores mean anything about your application.
Benchmarks that still discriminate (2026)
Despite the above, some are still informative:
Knowledge / reasoning
- **MMLU-Pro** — extension of MMLU; harder; less saturated. Distinguishes between mid-tier and frontier models.
- **GPQA Diamond** — graduate-physics-level questions; hard to memorise. Strong differentiator at the frontier.
- **MATH benchmark** — competition math. Still has signal.
- **AIME** — math olympiad problems. Hard; reasoning-heavy.
Code
- **SWE-bench Verified** — real-world coding tasks from GitHub issues. Hard to game; close to actual developer work. The benchmark for code agents.
- **LiveCodeBench** — competitive programming from Codeforces; released after training cutoffs. Resists contamination.
- **HumanEval+** — extension of HumanEval with more rigorous tests. Less saturated than original.
- **BigCodeBench** — broader code tasks beyond functional implementation.
Agent capability
- **τ-bench (TauBench)** — agent capability in dialog tasks (retail, airline customer support). Tests real multi-turn use.
- **WebArena, VisualWebArena** — agents acting on web pages. Tests real-world navigation and tool use.
- **SWE-bench Verified** — also relevant here.
- **GAIA** — general assistant tasks; multi-step reasoning + tool use.
Multimodal
- **MMMU** — multi-discipline multimodal QA. Standard for vision+language.
- **MathVista** — math reasoning with images.
- **DocVQA** — document understanding.
Subjective / preference
- **LMArena (Chatbot Arena)** — head-to-head human preference. Subject to style bias but captures something real about helpfulness.
- **MT-Bench** — multi-turn conversation; LLM-judge with human-correlated calibration.
Hallucination / factuality
- **TruthfulQA** — common misconceptions; tests refusal of plausible-but-wrong answers.
- **HaluEval** — broader hallucination eval.
- **LongFact, FreshQA** — for long-form and time-sensitive factuality.
Reasoning under context
- **NIAH (Needle in a Haystack)** — find a fact buried in a long context. Tests long-context recall.
- **RULER, BABILong** — more sophisticated long-context benchmarks.
How to read a benchmark result
When you see "Model X scores 78% on Y benchmark":
1. **What's Y exactly?** Most benchmarks have variants. MMLU 5-shot vs 0-shot vs CoT; numbers differ by 5-10 points.
2. **What's the contamination story?** Was Y in the training data?
3. **What's the methodology?** Greedy decoding, sampled, best-of-N, with reasoning mode? Each gives different numbers.
4. **What's the comparison?** "78% vs the prior best of 75%" is more informative than the absolute.
5. **Was the eval prompt strong?** Bad prompts produce bad scores; "did Model X get a fair shake."
Vendors often report cherry-picked configurations. Independent reports (Stanford HELM, the Hugging Face Open LLM Leaderboard, Lmsys Chatbot Arena) are more reliable than vendor blog posts.
Building your own eval
The single highest-leverage practice: maintain a held-out set of 100-500 real task inputs from your application, with expected outputs.
This is the eval that matters. Public benchmarks are for sanity checks; your eval is for decisions.
Steps:
1. **Sample** real production queries (or representative synthetic ones).
2. **Label** expected outputs. Human-labelled at first; later possibly LLM-judged with calibration.
3. **Categorise** into task subtypes for per-category breakdown.
4. **Add adversarial cases** — known-hard, known-edge-case examples.
5. **Freeze and version**. New eval runs always against the same set.
Run on every prompt, model, or pipeline change. Track scores over time; alert on regressions.
See [LlmEvaluationMetrics]() for the metric details and [AgentTesting]() for rollout-based variants for agents.
What "we evaluated against benchmarks" doesn't mean
Vendor claims that need scrutiny:
- "State of the art on N benchmarks." On *which* benchmarks; what was the prior state.
- "Comparable to GPT-4." On what tasks; in what conditions.
- "Best for code." On HumanEval (saturated) or SWE-bench (more meaningful)?
- "Surpasses humans." Almost always on a specific benchmark, often with caveats.
Be sceptical. Most marketing claims are about cherry-picked benchmark configurations.
Specific 2026 model selection guidance
When picking a model for production:
1. **Run a small qualitative test** on your task. Manually compare 5-10 outputs across candidates.
2. **Consult LMArena** for general capability ordering. Recent rankings.
3. **Check task-specific benchmarks** if available — SWE-bench for code, τ-bench for agents.
4. **Run your eval** on top candidates. This is the deciding factor.
5. **Pilot on shadow traffic** before committing.
Skip steps 2-3 only if you have a strong prior. The eval (step 4) is non-negotiable.
Eval-in-production
Benchmarks ship pre-deployment. In production, you also want continuous eval:
- **Sampled human review** of outputs.
- **LLM-as-judge on production samples** (calibrated).
- **Eval-set replay nightly** against the deployed system. Catches regressions before users do.
- **A/B testing for model or prompt changes** with primary metrics.
See [AiObservabilityInProduction]().
Failure modes in evaluation pipelines
- **Eval set in training.** Your held-out set was used for prompt examples; numbers lie.
- **Cherry-picked rollouts.** "I tried it once and it worked." Single rollouts are meaningless; aggregate.
- **Judge bias not calibrated.** LLM judge agrees with you / longer responses; disagree with humans.
- **No cost tracking.** Quality went up 5%; cost went up 50%; net loss.
- **Static eval set.** Real distribution shifts; eval set becomes irrelevant. Refresh periodically.
A pragmatic eval stack
For a team starting fresh:
1. **A 100-task eval set** from real production queries; labelled; versioned.
2. **A simple harness** that runs the set on a model, logs results, computes metrics.
3. **Track**: success rate, cost, latency, per-category breakdown.
4. **Run on every model / prompt change.**
5. **Quarterly refresh** — add new edge cases, retire saturated ones.
6. **LLM-judge** for fuzzy tasks, calibrated against human labels.
A week of work; immediate decision-making clarity.
Further reading
- [LlmEvaluationMetrics]() — metric details
- [AgentTesting]() — rollout-based eval for agents
- [AiHallucinationMitigation]() — factuality-specific evaluation