Running the Retrieval Quality Harness

Wikantik already has a retrieval experiment harness; what's pending is

the scheduler that turns it into a nightly CI gate. Until Phase 5

ships, the harness is invokable manually for spot-checks.

When to use this runbook

Before merging a retrieval-touching change, or when investigating an

anecdotal regression.

Context

The harness lives in `wikantik-knowledge/src/main/java/com/wikantik/knowledge/eval/`

(or its current equivalent). It evaluates a query set against a chosen

retrieval mode, computes nDCG@k, Recall@k, and MRR, and returns

per-query + aggregate scores.

Phase 5 of `AgentGradeContentDesign` adds the missing pieces: a

scheduled runner, a database-backed query-set store

(`retrieval_query_sets` / `retrieval_queries` / `retrieval_runs`),

Prometheus gauges, and a `RetrievalQualitySmokeTest` that runs in CI

on every merge.

Walkthrough

The frontmatter `steps` are the canonical procedure for the manual

path. Once Phase 5 lands, the same logic runs without operator

intervention nightly.

Pitfalls

The frontmatter `pitfalls` capture the recurring methodology mistakes.

"One run is authoritative" is the most common — retrieval scores have

real variance across runs and any single number is suspect.