Running the Retrieval Quality Harness
Wikantik already has a retrieval experiment harness; what's pending is
the scheduler that turns it into a nightly CI gate. Until Phase 5
ships, the harness is invokable manually for spot-checks.
When to use this runbook
Before merging a retrieval-touching change, or when investigating an
anecdotal regression.
Context
The harness lives in `wikantik-knowledge/src/main/java/com/wikantik/knowledge/eval/`
(or its current equivalent). It evaluates a query set against a chosen
retrieval mode, computes nDCG@k, Recall@k, and MRR, and returns
per-query + aggregate scores.
Phase 5 of `AgentGradeContentDesign` adds the missing pieces: a
scheduled runner, a database-backed query-set store
(`retrieval_query_sets` / `retrieval_queries` / `retrieval_runs`),
Prometheus gauges, and a `RetrievalQualitySmokeTest` that runs in CI
on every merge.
Walkthrough
The frontmatter `steps` are the canonical procedure for the manual
path. Once Phase 5 lands, the same logic runs without operator
intervention nightly.
Pitfalls
The frontmatter `pitfalls` capture the recurring methodology mistakes.
"One run is authoritative" is the most common — retrieval scores have
real variance across runs and any single number is suspect.