Running the Retrieval Quality Harness

Wikantik already has a retrieval experiment harness; what's pending is the scheduler that turns it into a nightly CI gate. Until Phase 5 ships, the harness is invokable manually for spot-checks.

When to use this runbook

Before merging a retrieval-touching change, or when investigating an anecdotal regression.

Context

The harness lives in wikantik-knowledge/src/main/java/com/wikantik/knowledge/eval/ (or its current equivalent). It evaluates a query set against a chosen retrieval mode, computes nDCG@k, Recall@k, and MRR, and returns per-query + aggregate scores.

Phase 5 of AgentGradeContentDesign adds the missing pieces: a scheduled runner, a database-backed query-set store (retrieval_query_sets / retrieval_queries / retrieval_runs), Prometheus gauges, and a RetrievalQualitySmokeTest that runs in CI on every merge.

Walkthrough

The frontmatter steps are the canonical procedure for the manual path. Once Phase 5 lands, the same logic runs without operator intervention nightly.

Pitfalls

The frontmatter pitfalls capture the recurring methodology mistakes. "One run is authoritative" is the most common — retrieval scores have real variance across runs and any single number is suspect.