LLM Evaluation Metrics

Standard software testing relies on deterministic assertions. LLM testing relies on statistical distributions. Choosing the wrong metric leads to "benchmark chasing" where a model improves on paper but regresses in the hands of users.

The Hierarchy of Metrics

| Metric Type | Example | Use Case | Limitations |

|---|---|---|---|

| **Deterministic** | Exact Match, JSON Schema Validity | Code Gen, Extraction, Structured I/O | Too rigid for creative tasks. |

| **N-Gram Overlap** | ROUGE-L, BLEU | Summarization, Translation | Penalizes synonyms; blind to factual correctness. |

| **Model-Based** | BERTScore, G-Eval | Semantic alignment, nuance | High cost; potential "Judge Bias". |

| **Human-in-the-Loop** | Likert Scale, Pairwise Pref | Final product validation | Extremely slow and expensive. |

Code-Specific Metrics

For engineering tasks, n-gram overlap is useless. We use **Pass@k**.

A model generates $n$ samples for a coding problem. If $c$ samples pass the unit tests, the probability that at least one of $k$ samples passes is:

$$ \text{Pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} $$

*Practitioner Note: In production, always report Pass@1. Pass@10 and Pass@100 are often used to inflate results in academic papers.*

RAG-Specific Evaluation (RAGAS)

Evaluating a Retrieval-Augmented Generation system requires breaking the problem into two parts: retrieval quality and generation quality.

1. **Faithfulness:** Does the answer derive *only* from the retrieved context? (Prevents hallucination).

2. **Answer Relevance:** Does the answer actually address the user's prompt?

3. **Context Precision:** Is the retrieved context actually useful for answering the question?

The LLM-as-Judge Pattern

Using a model like GPT-4o to grade a smaller model (e.g., Llama 3) is now the industry standard for subjective tasks.

```python

Reference rubric for a Judge LLM

JUDGE_PROMPT = """

Evaluate the assistant's response based on Accuracy and Conciseness.

Score 1-5.

A score of 5 means the answer contains zero hallucinations and no fluff.

Context: {retrieved_context}

Question: {user_query}

Response: {assistant_response}

"""

```

**CRITICAL: Judge Calibration.** You must manually grade 100 samples alongside the Judge LLM. If your agreement rate is below 80%, your rubric is too vague.

Public Benchmarks to Watch (2026)

- **MMLU-Pro:** A harder, cleaner version of the classic MMLU (Massive Multitask Language Understanding).

- **GPQA:** Graduate-level science questions that are Google-proof.

- **HumanEval:** The baseline for Python code generation.

- **SWE-bench:** Real GitHub issues that require the model to edit multiple files.

Further Reading

- [AgentTesting](AgentTesting) — Moving from single-call metrics to multi-turn trajectory eval.

- [AiEvaluationAndBenchmarks](AiEvaluationAndBenchmarks) — How to interpret Leaderboard scores.

- [RetrievalExperimentHarness](RetrievalExperimentHarness) — Building a local test suite for RAG.