Evaluating Retrieval Quality: From Vibes to MRR

Search quality is notoriously difficult to measure because "relevance" is subjective. Without a rigorous evaluation harness, every change to a search algorithm is just a "vibe check." Industrial systems use a Gold Set of queries to produce deterministic metrics.

1. The Core Metrics

A. Recall@K

The percentage of queries for which the "ideal" document appears in the top $K$ results.

Why it matters: In a wiki or RAG system, if the answer isn't in the top 5 (Recall@5), the agent or human likely won't find it.

B. Mean Reciprocal Rank (MRR)

MRR rewards the system for putting the correct answer as high as possible.

\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}

Where $\text{rank}_i$ is the position of the first relevant document for query $i$ .

Intuition: Getting the answer at rank 1 is twice as good as getting it at rank 2.

2. Building a "Gold Set"

A retrieval evaluation set (e.g., eval/retrieval-queries.csv) consists of triples: (query, ideal_page, category).

Quality Categories:

Direct: "Ollama Setup" $\rightarrow$ OllamaSetup. (Testing lexical accuracy).
Synonym Drift: "Running wiki in a container" $\rightarrow$ WikantikOnDocker. (Testing semantic/vector accuracy).
Indirect/Hard: "How do I deploy locally?" $\rightarrow$ JspwikiDeployment. (Testing reasoning and conceptual mapping).

3. The Continuous Evaluation Loop (CI/CD)

Search evaluation should be an automated part of the build pipeline, not a manual task.

Baseline: Run the evaluation set and commit the results (Recall@5, MRR).
Experiment: Modify a weight or add a reranking step.
Validate: Run the evaluation again. If MRR drops, the "improvement" is a regression.

External Deep Dive:

Evaluation Measures for IR (Wikipedia) — Deep dive into MAP and NDCG.
Mean Reciprocal Rank (Wikipedia) — Formal properties of MRR.

See Also:

Wikantik Search Refinement — How to run the search-eval tool.
Industrial Search Systems — The underlying architecture.