Wikantik Search Refinement
Retrieval quality — "did search put the right page in the top 5?" — is measured here by a seed set of `(query, ideal_page)` pairs run through the live wiki via `bin/search-eval`, reported as `recall@5`, `recall@20`, and `MRR`. Improvements (synonym handling, dense embeddings, rerankers) are validated against this set; without it, every future retrieval change is a vibe check.
The tool talks to whatever wiki you point it at — local dev, staging, production. It tests the *deployed* retriever, not an isolated in-process copy, so it picks up ACL filtering, index state, middleware, and any retrieval change that lands without needing code changes to the tool itself.
The three locations
- **Source of truth — edit here:**
`eval/retrieval-queries.csv`
Three columns: `query, ideal_page, notes`. Add, remove, or remap queries in your editor. The `notes` column carries a category label (`synonym-drift`, `indirect`, `multi-word-concept`, `specific`, `general`, `business-process`, `hard`) so the per-category breakdown in the report is meaningful.
- **Rolling output — regenerated each run:**
Whatever you pass to `--out`, or just stdout. Not committed; throwaway.
- **Frozen checkpoint — commit when it matters:**
`docs/superpowers/specs/2026-04-17-retrieval-eval-baseline.md`
A snapshot of the report at a known point in time. Update it when the query set is finalized, a new retriever lands, or you want a before-and-after reference to cite.
Running the tool
Prerequisite: a wiki running somewhere you can reach. Local dev is easiest:
```bash
tomcat/tomcat-11/bin/startup.sh
```
Then:
```bash
Default: localhost:8080, test.properties for credentials, all queries
bin/search-eval
Write the report to a file while also printing it
bin/search-eval --out eval/reports/2026-04-18.txt
Point at staging, anonymous user
bin/search-eval --base-url https://staging.example.com --anonymous
JSON output, for diffing runs programmatically
bin/search-eval --json --out run.json
Quieter — summary + missed-query list only, no full per-query table
bin/search-eval --quiet
Full help
bin/search-eval --help
```
No Python dependencies — standard library only. Just `python3`.
**Exit codes:** `0` clean run, `1` at least one query produced an HTTP/network error (report enumerates them), `2` configuration error (missing queries file, unreachable endpoint).
The typical loop
1. Edit `eval/retrieval-queries.csv`. Fix a bad `ideal_page`, add a query your team actually asks, drop one that turned out to have no good answer in the corpus.
2. Run `bin/search-eval --out eval/reports/try-N.txt`.
3. Read the report. Compare the summary line (recall@5, recall@20, MRR) and the missed-query list against the baseline doc.
4. Decide per query:
- If a query moved from rank 193 to rank 4 — curation win.
- If recall@5 dropped — is the query actually harder than you thought, is the `ideal_page` wrong, or did retrieval regress?
- If a query always misses despite clearly having a good page — worth a bug.
5. Commit the CSV when satisfied. Update the baseline doc at natural milestones: end of a curation session, before swapping BM25 for hybrid retrieval, after a reranker lands.
What the tool does
`bin/search-eval` is a standalone Python 3 script under `bin/`:
1. Loads `eval/retrieval-queries.csv` (or whatever `--queries` points at).
2. Reads credentials from `test.properties` at the repo root, or `--user/--password`, or skips auth with `--anonymous`.
3. For each query, issues `GET /api/search?q=<query>&limit=<N>` against the configured base URL.
4. Finds the 1-based rank of the `ideal_page` in the returned results (or `-1` if missing, `-2` on HTTP error).
5. Computes `recall@5`, `recall@20`, MRR, and per-category breakdowns.
6. Writes a formatted text report (default) or JSON (`--json`) to stdout, plus optionally to a file.
No threshold assertions — the tool is informational. Wire it into CI as a separate nightly or pre-deploy job when you're ready to alert on regressions.
Query-writing heuristics
The value of the eval set is in query *variety*, not volume. Include:
- **Synonym drift.** Words in the query don't literally appear in the title. "Auth config" → a page titled `AuthenticationManager`. If every query is a title keyword hunt, the eval will never surface the problem that dense embeddings fix.
- **Indirect phrasing.** Natural how-to language instead of title-ish language. "How do I get started as a new hire" instead of "onboarding."
- **Multi-word concepts.** Three-to-five-word technical concepts the page discusses without necessarily title-matching.
- **Specific vs. general.** Both "Gemma 4 VRAM budget" and "local LLM hardware" should surface reasonable answers.
- **Business entities.** Named clients, products, internal codes — the queries that motivate the consulting-wiki use case.
- **Hard cases.** Queries your team has seen the LLM fumble. Abbreviations ("k8s"), short common terms ("ai"), or ambiguous phrasings.
Anti-patterns:
- **Title-copy queries.** "WikantikDevelopment" as a query against `WikantikDevelopment` is trivially passing.
- **Multiple equally-right answers.** Forces a single-answer framing; the current tool scores only one `ideal_page` per query.
- **Queries with no good answer in the corpus.** Ungradable; drop them.
Interpreting the per-category breakdown
The report groups recall by the `notes` label. Categories tell you *where* retrieval is weak, not just how weak overall:
- `specific` near 1.0: keyword BM25 is doing its job on literal terms.
- `multi-word-concept` near 1.0: probably too many title-ish queries — add drift.
- `synonym-drift` and `indirect` low: the expected gap that dense embeddings should close.
- `business-process` low: the wiki hasn't built much terminological structure around these topics yet, or the queries need rewriting.
- `hard` stubbornly low: these are the intentionally-hard cases; don't over-optimize for them at the expense of the easier categories.
Track the categories over time. A retrieval change that lifts `specific` from 0.9 to 1.0 while dragging `synonym-drift` from 0.4 to 0.2 is a regression worth rolling back.
Next step: hybrid BM25 + dense retrieval
The baseline report makes the gap visible: BM25 is already near-ceiling on `specific` and `multi-word-concept` (recall@5 ≈ 0.80, recall@20 = 1.00) and weak on `indirect` (0.50), `general` (0.20), and `business-process` (0.40). The standard move here is a second retriever that reads meaning rather than surface tokens, fused with BM25 by Reciprocal Rank Fusion (RRF) — no tuning, no threshold, just interleaved ranks. Any query where one retriever nails it carries the fused result; queries where both fail stay failed and surface in the eval as genuine gaps.
Mechanically: an embedding model turns each chunk (and each query) into a vector; pgvector stores the chunk vectors with an HNSW index; retrieval returns BM25 top-K and vector top-K, then RRF combines them. The existing `ChunkInspector` admin tab already shows what a chunk looks like, so the chunking step is not new work.
Picking an embedding model
Three questions decide the model: **deployment target** (GPU available or CPU-only), **quality floor** (how much recall lift you need), and **latency budget** (query-time embedding adds to every search).
**Quality tiers, open-source, as of 2026:**
| Tier | Model | Params | Dim | Notes |
|------|-------|--------|-----|-------|
| Top | Qwen3-Embedding-8B | 8 B | ≤ 4096 | SOTA open, multilingual + code |
| Strong | Qwen3-Embedding-4B | 4 B | ≤ 4096 | GPU-class; int8 makes it CPU-viable with patience |
| Balanced | bge-m3 | 568 M | 1024 | Dense + sparse + ColBERT in one model |
| Balanced | Qwen3-Embedding-0.6B | 600 M | up to 1024 | Small-but-strong, code-aware |
| Balanced | nomic-embed-text-v1.5 | 137 M | 768 (Matryoshka 64–768) | Fully open, 8K context |
| Light | bge-small-en-v1.5 | 33 M | 384 | Small, fast, solid English |
| Light | mxbai-embed-large-v1 | 335 M | 1024 | Good English baseline |
Configuration points that apply to all of them:
1. **Prompt prefixes matter.** nomic uses `search_document: …` / `search_query: …`. Qwen3 uses instruction prompts per the model card. bge-m3 is optional but benefits from an instruction. Wrong or missing prefixes silently tank recall — verify with two or three known-good queries after any model change.
2. **Normalize vectors and use cosine distance** (`<=>` in pgvector). All the models above are trained with cosine; using `<->` (L2) on un-normalized vectors is a common silent bug.
3. **Index**: `hnsw` over `ivfflat` at this corpus size (~1K pages, few-K chunks). Start `m = 16, ef_construction = 64`; tune `ef_search` per query for the recall/latency tradeoff. Rebuild the index only when the embedding model changes.
4. **Matryoshka truncation** (nomic, Qwen3) lets you store full-dim and query-truncate for a real speedup with small recall cost — apply only after measuring the default.
5. **Quantization**: fp16 is the default; int8 roughly halves memory and doubles CPU throughput for < 0.5 % MTEB loss. Use it on CPU; usually unnecessary on GPU.
6. **Chunk size**: 512 tokens with 64-token overlap is a good default; raise to 1024 for long design docs. The chunker already exists; `ChunkInspector` is the debugging tool for this.
Deployment target A: GPU box (reference: RTX 4060 Ti 16 GB)
Fits comfortably: Qwen3-Embedding-0.6B, bge-m3, nomic, any light-tier model. Fits tightly: Qwen3-Embedding-4B (fp16 ≈ 8 GB, q4 ≈ 4 GB). Does not fit with any serious local LLM also resident: Qwen3-Embedding-8B or NV-Embed-v2.
Recommended default for a dev-wiki use case: **Qwen3-Embedding-0.6B, fp16, normalized, cosine, HNSW**. Code-aware, strong benchmarks, plenty of VRAM headroom if an LLM lands on the same card later. Move up to bge-m3 if you want the built-in sparse head, which could eventually replace Lucene BM25 and collapse the hybrid stack into a single model. Move up to Qwen3-4B only if the eval shows a persistent quality ceiling on `indirect` / `business-process`.
Expected query-embed latency (single query, fp16):
| Model | VRAM | Query latency |
|-------|------|---------------|
| nomic | ~0.3 GB | 3–5 ms |
| bge-m3 | ~1.2 GB | 10–20 ms |
| Qwen3-0.6B | ~1.5 GB | 15–30 ms |
| Qwen3-4B | ~8 GB (fp16) | 60–120 ms |
Deployment target B: CPU-only box
This is the interesting case — customer-site deployments, ops boxes without a GPU, or a dedicated mini-PC. The reference we're planning against:
**NiPoGi AM06 PRO** — AMD Ryzen 7 7730U (Zen 3, 8C/16T, AVX2 + FMA3, **no AVX-512**, **no AMD NPU**), 32 GB RAM, 512 GB M.2 SSD, integrated Vega 8 iGPU (not useful for ML — ROCm on Barcelo is a dead end), dual GbE, configurable cTDP 10–25 W. Roughly 60–70 % of an Intel AVX-512/VNNI box of similar core count on int8 embedding workloads — slower than a discrete GPU by a factor of ~10–30×, but fast enough for a ~1K-page wiki.
Top picks for CPU-only, ordered by the usual tradeoff:
| Model | Params | Dim | CPU latency/query (Zen 3, int8) | Notes |
|-------|--------|-----|--------------------------------|-------|
| bge-small-en-v1.5 | 33 M | 384 | 10–20 ms | Tiny, fast, English-strong |
| nomic-embed-text-v1.5 | 137 M | 768 | 30–80 ms | 8K context, Matryoshka, fully open |
| bge-m3 | 568 M | 1024 | 50–90 ms | Unified dense + sparse + ColBERT |
| Qwen3-Embedding-0.6B | 600 M | up to 1024 | 80–200 ms | Upper-bound CPU quality |
Recommended default for this box: **bge-m3, int8 ONNX, served by Hugging Face `text-embeddings-inference` (TEI)**. Multi-functional dense+sparse in one model, production-grade HTTP server with batching and concurrent request handling, built-in Prometheus metrics, OpenAI-compatible `/v1/embeddings` endpoint, stable Docker image.
Setup sketch (Ubuntu Server 24.04, Docker, bge-m3 int8):
```bash
BIOS: set cTDP to 25 W if thermals hold, 20 W otherwise.
Verify with `stress-ng --cpu 16 --timeout 600s` while watching `sensors`.
mkdir -p ~/tei/data
cat > ~/tei/docker-compose.yml <<'YAML'
services:
tei:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.6
restart: unless-stopped
ports: ["8001:80"]
volumes: ["./data:/data"]
environment:
OMP_NUM_THREADS: "8" # physical cores, not logical
RUST_LOG: "info"
command: >
--model-id BAAI/bge-m3
--dtype float16
--max-batch-tokens 16384
--max-concurrent-requests 64
--pooling cls
YAML
docker compose -f ~/tei/docker-compose.yml up -d
curl -s localhost:8001/embed \
-H 'content-type: application/json' \
-d '{"inputs": ["hello wikantik"]}' | jq '.[0] | length'
Expect 1024.
```
Pre-export int8 once if the image build doesn't auto-quantize:
```bash
optimum-cli export onnx --model BAAI/bge-m3 \
--task feature-extraction --device cpu ./data/bge-m3-onnx
optimum-cli onnxruntime quantize --onnx_model ./data/bge-m3-onnx \
--avx2 -o ./data/bge-m3-onnx-int8
Then change --model-id to /data/bge-m3-onnx-int8.
```
Expected throughput on this box:
- **Full reindex** (~1K pages × ~5 chunks = ~5K chunks): ~40–60 s, cold, one-shot. Rare event.
- **Incremental on page save** (~5 chunks): ~300 ms — synchronous on save is fine.
- **Query embed**: ~60–90 ms + pgvector HNSW ~5 ms + transport → end-to-end retrieval stays under ~150 ms.
Wikantik integration shape
The wiki and PostgreSQL stay where they live today. Only the embedding transform moves to the GPU or mini-PC box. Concretely:
1. **Schema migration** adds an `embeddings` table keyed on `(page, chunk_id)` with a `vector(dim)` column and an `hnsw` index. Empty table, reversible.
2. **`EmbeddingClient`** in `wikantik-main` — small HTTP client that `POST`s to TEI's `/embed` (or local Ollama / whatever backend), with a connection pool, timeout, and retry. No heavyweight SDK needed.
3. **Indexer hook** on the page-save pipeline: chunk the page, embed each chunk, upsert into `embeddings`. One-shot backfill script for the existing corpus.
4. **Retrieval path**: `SearchResource` grows a hybrid branch — BM25 top-K (existing Lucene path) + vector top-K (pgvector) fused by RRF.
5. **Feature flag** `wikantik.search.hybrid.enabled`. When off or the embedding service is unreachable, fall back to pure BM25. The flag decides whether the vector path runs; BM25 is always wired up as the safety net.
Security and ops for the embedding service
- **Auth**: pass `--api-key` to TEI, put the shared secret in `wikantik-custom.properties`, send `Authorization: Bearer …` from `EmbeddingClient`.
- **Network**: WireGuard or Tailscale tunnel between the wiki host and the embedding box; don't expose TEI on the LAN without auth.
- **Metrics**: TEI exposes `/metrics` in Prometheus format — scrape it from the `wikantik-observability` stack. Alert on p95 request latency and container restarts.
- **Updates**: pin the TEI image tag in `docker-compose.yml`; `docker compose pull && up -d` for upgrades.
- **Thermal sanity**: small mini-PCs throttle under sustained load. Verify with `stress-ng` for 10+ minutes and watch `sensors` before committing a cTDP setting.
Gating the change on the eval
Any dense / hybrid rollout commits to re-running `bin/search-eval` and diffing against the baseline in `docs/superpowers/specs/2026-04-17-retrieval-eval-baseline.md`. The merge criterion is **`indirect` and `general` recall@5 lift without regression on `specific` or `multi-word-concept`**. If `specific` drops more than ~0.05 while `indirect` lifts, the fusion weighting is wrong (or BM25 is being overridden) and the change isn't ready.
Why a standalone tool (not a JUnit test)
Previous versions of this harness lived inside `wikantik-main` as a `@Disabled` JUnit test that built its own `TestEngine` and indexed `docs/wikantik-pages/` into a scratch directory. That worked but wasn't the shape an eval tool wants:
- Reindexed from scratch on every run (~30–90 s of overhead).
- Tested the test-path retriever, not whatever's actually deployed.
- Enabling it required `-DincludeDisabled=true -Dtest=RetrievalEvalTest` — a magic invocation rather than a direct command.
- Tied to the JVM and Maven.
`bin/search-eval` replaces that with an HTTP-based tool that:
- Runs in under a few seconds against whatever base URL you provide.
- Sees the real deployed retriever (indexed state, ACLs, any middleware).
- Is a direct command, not a testing gymnastics invocation.
- Requires no Maven, no JVM — just `python3`.
The `eval/retrieval-queries.csv` file is the same format as before; moving from test-classpath resources to a top-level `eval/` directory was purely a home-finding exercise. The committed baseline in `docs/superpowers/specs/2026-04-17-retrieval-eval-baseline.md` is still valid as a reference point — the numbers it reports came from the same underlying Lucene BM25, produced by the JUnit harness at the time.
See also
- The baseline numbers and missed-query list:
`docs/superpowers/specs/2026-04-17-retrieval-eval-baseline.md`
- The tool:
`bin/search-eval`
- Seed query set:
`eval/retrieval-queries.csv`
- The search endpoint it calls:
`wikantik-rest/src/main/java/com/wikantik/rest/SearchResource.java` → `GET /api/search?q=…&limit=…`