Wikantik Search Refinement

Retrieval quality — "did search put the right page in the top 5?" — is measured here by a seed set of (query, ideal_page) pairs run through the live wiki via bin/search-eval, reported as recall@5, recall@20, and MRR. Improvements (synonym handling, dense embeddings, rerankers) are validated against this set; without it, every future retrieval change is a vibe check.

The tool talks to whatever wiki you point it at — local dev, staging, production. It tests the deployed retriever, not an isolated in-process copy, so it picks up ACL filtering, index state, middleware, and any retrieval change that lands without needing code changes to the tool itself.

The three locations

Source of truth — edit here: eval/retrieval-queries.csv Three columns: query, ideal_page, notes. Add, remove, or remap queries in your editor. The notes column carries a category label (synonym-drift, indirect, multi-word-concept, specific, general, business-process, hard) so the per-category breakdown in the report is meaningful.
Rolling output — regenerated each run: Whatever you pass to --out, or just stdout. Not committed; throwaway.
Frozen checkpoint — commit when it matters: docs/superpowers/specs/2026-04-17-retrieval-eval-baseline.md A snapshot of the report at a known point in time. Update it when the query set is finalized, a new retriever lands, or you want a before-and-after reference to cite.

Running the tool

Prerequisite: a wiki running somewhere you can reach. Local dev is easiest:

tomcat/tomcat-11/bin/startup.sh

Then:

# Default: localhost:8080, test.properties for credentials, all queries
bin/search-eval

# Write the report to a file while also printing it
bin/search-eval --out eval/reports/2026-04-18.txt

# Point at staging, anonymous user
bin/search-eval --base-url https://staging.example.com --anonymous

# JSON output, for diffing runs programmatically
bin/search-eval --json --out run.json

# Quieter — summary + missed-query list only, no full per-query table
bin/search-eval --quiet

# Full help
bin/search-eval --help

No Python dependencies — standard library only. Just python3.

Exit codes: 0 clean run, 1 at least one query produced an HTTP/network error (report enumerates them), 2 configuration error (missing queries file, unreachable endpoint).

The typical loop

Edit eval/retrieval-queries.csv. Fix a bad ideal_page, add a query your team actually asks, drop one that turned out to have no good answer in the corpus.
Run bin/search-eval --out eval/reports/try-N.txt.
Read the report. Compare the summary line (recall@5, recall@20, MRR) and the missed-query list against the baseline doc.
Decide per query:
- If a query moved from rank 193 to rank 4 — curation win.
- If recall@5 dropped — is the query actually harder than you thought, is the ideal_page wrong, or did retrieval regress?
- If a query always misses despite clearly having a good page — worth a bug.
Commit the CSV when satisfied. Update the baseline doc at natural milestones: end of a curation session, before swapping BM25 for hybrid retrieval, after a reranker lands.

What the tool does

bin/search-eval is a standalone Python 3 script under bin/:

Loads eval/retrieval-queries.csv (or whatever --queries points at).
Reads credentials from test.properties at the repo root, or --user/--password, or skips auth with --anonymous.
For each query, issues GET /api/search?q=<query>&limit=<N> against the configured base URL.
Finds the 1-based rank of the ideal_page in the returned results (or -1 if missing, -2 on HTTP error).
Computes recall@5, recall@20, MRR, and per-category breakdowns.
Writes a formatted text report (default) or JSON (--json) to stdout, plus optionally to a file.

No threshold assertions — the tool is informational. Wire it into CI as a separate nightly or pre-deploy job when you're ready to alert on regressions.

Query-writing heuristics

The value of the eval set is in query variety, not volume. Include:

Synonym drift. Words in the query don't literally appear in the title. "Auth config" → a page titled AuthenticationManager. If every query is a title keyword hunt, the eval will never surface the problem that dense embeddings fix.
Indirect phrasing. Natural how-to language instead of title-ish language. "How do I get started as a new hire" instead of "onboarding."
Multi-word concepts. Three-to-five-word technical concepts the page discusses without necessarily title-matching.
Specific vs. general. Both "Gemma 4 VRAM budget" and "local LLM hardware" should surface reasonable answers.
Business entities. Named clients, products, internal codes — the queries that motivate the consulting-wiki use case.
Hard cases. Queries your team has seen the LLM fumble. Abbreviations ("k8s"), short common terms ("ai"), or ambiguous phrasings.

Anti-patterns:

Title-copy queries. "WikantikDevelopment" as a query against WikantikDevelopment is trivially passing.
Multiple equally-right answers. Forces a single-answer framing; the current tool scores only one ideal_page per query.
Queries with no good answer in the corpus. Ungradable; drop them.

Interpreting the per-category breakdown

The report groups recall by the notes label. Categories tell you where retrieval is weak, not just how weak overall:

specific near 1.0: keyword BM25 is doing its job on literal terms.
multi-word-concept near 1.0: probably too many title-ish queries — add drift.
synonym-drift and indirect low: the expected gap that dense embeddings should close.
business-process low: the wiki hasn't built much terminological structure around these topics yet, or the queries need rewriting.
hard stubbornly low: these are the intentionally-hard cases; don't over-optimize for them at the expense of the easier categories.

Track the categories over time. A retrieval change that lifts specific from 0.9 to 1.0 while dragging synonym-drift from 0.4 to 0.2 is a regression worth rolling back.

Next step: hybrid BM25 + dense retrieval

The baseline report makes the gap visible: BM25 is already near-ceiling on specific and multi-word-concept (recall@5 ≈ 0.80, recall@20 = 1.00) and weak on indirect (0.50), general (0.20), and business-process (0.40). The standard move here is a second retriever that reads meaning rather than surface tokens, fused with BM25 by Reciprocal Rank Fusion (RRF) — no tuning, no threshold, just interleaved ranks. Any query where one retriever nails it carries the fused result; queries where both fail stay failed and surface in the eval as genuine gaps.

Mechanically: an embedding model turns each chunk (and each query) into a vector; pgvector stores the chunk vectors with an HNSW index; retrieval returns BM25 top-K and vector top-K, then RRF combines them. The existing ChunkInspector admin tab already shows what a chunk looks like, so the chunking step is not new work.

Picking an embedding model

Three questions decide the model: deployment target (GPU available or CPU-only), quality floor (how much recall lift you need), and latency budget (query-time embedding adds to every search).

Quality tiers, open-source, as of 2026:

Tier	Model	Params	Dim	Notes
Top	Qwen3-Embedding-8B	8 B	≤ 4096	SOTA open, multilingual + code
Strong	Qwen3-Embedding-4B	4 B	≤ 4096	GPU-class; int8 makes it CPU-viable with patience
Balanced	bge-m3	568 M	1024	Dense + sparse + ColBERT in one model
Balanced	Qwen3-Embedding-0.6B	600 M	up to 1024	Small-but-strong, code-aware
Balanced	nomic-embed-text-v1.5	137 M	768 (Matryoshka 64–768)	Fully open, 8K context
Light	bge-small-en-v1.5	33 M	384	Small, fast, solid English
Light	mxbai-embed-large-v1	335 M	1024	Good English baseline

Configuration points that apply to all of them:

Prompt prefixes matter. nomic uses search_document: … / search_query: …. Qwen3 uses instruction prompts per the model card. bge-m3 is optional but benefits from an instruction. Wrong or missing prefixes silently tank recall — verify with two or three known-good queries after any model change.
Normalize vectors and use cosine distance (<=> in pgvector). All the models above are trained with cosine; using <-> (L2) on un-normalized vectors is a common silent bug.
Index: hnsw over ivfflat at this corpus size (~1K pages, few-K chunks). Start m = 16, ef_construction = 64; tune ef_search per query for the recall/latency tradeoff. Rebuild the index only when the embedding model changes.
Matryoshka truncation (nomic, Qwen3) lets you store full-dim and query-truncate for a real speedup with small recall cost — apply only after measuring the default.
Quantization: fp16 is the default; int8 roughly halves memory and doubles CPU throughput for < 0.5 % MTEB loss. Use it on CPU; usually unnecessary on GPU.
Chunk size: 512 tokens with 64-token overlap is a good default; raise to 1024 for long design docs. The chunker already exists; ChunkInspector is the debugging tool for this.

Deployment target A: GPU box (reference: RTX 4060 Ti 16 GB)

Fits comfortably: Qwen3-Embedding-0.6B, bge-m3, nomic, any light-tier model. Fits tightly: Qwen3-Embedding-4B (fp16 ≈ 8 GB, q4 ≈ 4 GB). Does not fit with any serious local LLM also resident: Qwen3-Embedding-8B or NV-Embed-v2.

Recommended default for a dev-wiki use case: Qwen3-Embedding-0.6B, fp16, normalized, cosine, HNSW. Code-aware, strong benchmarks, plenty of VRAM headroom if an LLM lands on the same card later. Move up to bge-m3 if you want the built-in sparse head, which could eventually replace Lucene BM25 and collapse the hybrid stack into a single model. Move up to Qwen3-4B only if the eval shows a persistent quality ceiling on indirect / business-process.

Expected query-embed latency (single query, fp16):

Model	VRAM	Query latency
nomic	~0.3 GB	3–5 ms
bge-m3	~1.2 GB	10–20 ms
Qwen3-0.6B	~1.5 GB	15–30 ms
Qwen3-4B	~8 GB (fp16)	60–120 ms

Deployment target B: CPU-only box

This is the interesting case — customer-site deployments, ops boxes without a GPU, or a dedicated mini-PC. The reference we're planning against:

NiPoGi AM06 PRO — AMD Ryzen 7 7730U (Zen 3, 8C/16T, AVX2 + FMA3, no AVX-512, no AMD NPU), 32 GB RAM, 512 GB M.2 SSD, integrated Vega 8 iGPU (not useful for ML — ROCm on Barcelo is a dead end), dual GbE, configurable cTDP 10–25 W. Roughly 60–70 % of an Intel AVX-512/VNNI box of similar core count on int8 embedding workloads — slower than a discrete GPU by a factor of ~10–30×, but fast enough for a ~1K-page wiki.

Top picks for CPU-only, ordered by the usual tradeoff:

Model	Params	Dim	CPU latency/query (Zen 3, int8)	Notes
bge-small-en-v1.5	33 M	384	10–20 ms	Tiny, fast, English-strong
nomic-embed-text-v1.5	137 M	768	30–80 ms	8K context, Matryoshka, fully open
bge-m3	568 M	1024	50–90 ms	Unified dense + sparse + ColBERT
Qwen3-Embedding-0.6B	600 M	up to 1024	80–200 ms	Upper-bound CPU quality

Recommended default for this box: bge-m3, int8 ONNX, served by Hugging Face text-embeddings-inference (TEI). Multi-functional dense+sparse in one model, production-grade HTTP server with batching and concurrent request handling, built-in Prometheus metrics, OpenAI-compatible /v1/embeddings endpoint, stable Docker image.

Setup sketch (Ubuntu Server 24.04, Docker, bge-m3 int8):

# BIOS: set cTDP to 25 W if thermals hold, 20 W otherwise.
# Verify with `stress-ng --cpu 16 --timeout 600s` while watching `sensors`.

mkdir -p ~/tei/data
cat > ~/tei/docker-compose.yml <<'YAML'
services:
  tei:
    image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.6
    restart: unless-stopped
    ports: ["8001:80"]
    volumes: ["./data:/data"]
    environment:
      OMP_NUM_THREADS: "8"          # physical cores, not logical
      RUST_LOG: "info"
    command: >
      --model-id BAAI/bge-m3
      --dtype float16
      --max-batch-tokens 16384
      --max-concurrent-requests 64
      --pooling cls
YAML
docker compose -f ~/tei/docker-compose.yml up -d

curl -s localhost:8001/embed \
  -H 'content-type: application/json' \
  -d '{"inputs": ["hello wikantik"]}' | jq '.[0] | length'
# Expect 1024.

Pre-export int8 once if the image build doesn't auto-quantize:

optimum-cli export onnx --model BAAI/bge-m3 \
    --task feature-extraction --device cpu ./data/bge-m3-onnx
optimum-cli onnxruntime quantize --onnx_model ./data/bge-m3-onnx \
    --avx2 -o ./data/bge-m3-onnx-int8
# Then change --model-id to /data/bge-m3-onnx-int8.

Expected throughput on this box:

Full reindex (~1K pages × ~5 chunks = ~5K chunks): ~40–60 s, cold, one-shot. Rare event.
Incremental on page save (~5 chunks): ~300 ms — synchronous on save is fine.
Query embed: ~60–90 ms + pgvector HNSW ~5 ms + transport → end-to-end retrieval stays under ~150 ms.

Wikantik integration shape

The wiki and PostgreSQL stay where they live today. Only the embedding transform moves to the GPU or mini-PC box. Concretely:

Schema migration adds an embeddings table keyed on (page, chunk_id) with a vector(dim) column and an hnsw index. Empty table, reversible.
EmbeddingClient in wikantik-main — small HTTP client that POSTs to TEI's /embed (or local Ollama / whatever backend), with a connection pool, timeout, and retry. No heavyweight SDK needed.
Indexer hook on the page-save pipeline: chunk the page, embed each chunk, upsert into embeddings. One-shot backfill script for the existing corpus.
Retrieval path: SearchResource grows a hybrid branch — BM25 top-K (existing Lucene path) + vector top-K (pgvector) fused by RRF.
Feature flag wikantik.search.hybrid.enabled. When off or the embedding service is unreachable, fall back to pure BM25. The flag decides whether the vector path runs; BM25 is always wired up as the safety net.

Security and ops for the embedding service

Auth: pass --api-key to TEI, put the shared secret in wikantik-custom.properties, send Authorization: Bearer … from EmbeddingClient.
Network: WireGuard or Tailscale tunnel between the wiki host and the embedding box; don't expose TEI on the LAN without auth.
Metrics: TEI exposes /metrics in Prometheus format — scrape it from the wikantik-observability stack. Alert on p95 request latency and container restarts.
Updates: pin the TEI image tag in docker-compose.yml; docker compose pull && up -d for upgrades.
Thermal sanity: small mini-PCs throttle under sustained load. Verify with stress-ng for 10+ minutes and watch sensors before committing a cTDP setting.

Gating the change on the eval

Any dense / hybrid rollout commits to re-running bin/search-eval and diffing against the baseline in docs/superpowers/specs/2026-04-17-retrieval-eval-baseline.md. The merge criterion is indirect and general recall@5 lift without regression on specific or multi-word-concept. If specific drops more than ~0.05 while indirect lifts, the fusion weighting is wrong (or BM25 is being overridden) and the change isn't ready.

Why a standalone tool (not a JUnit test)

Previous versions of this harness lived inside wikantik-main as a @Disabled JUnit test that built its own TestEngine and indexed docs/wikantik-pages/ into a scratch directory. That worked but wasn't the shape an eval tool wants:

Reindexed from scratch on every run (~30–90 s of overhead).
Tested the test-path retriever, not whatever's actually deployed.
Enabling it required -DincludeDisabled=true -Dtest=RetrievalEvalTest — a magic invocation rather than a direct command.
Tied to the JVM and Maven.

bin/search-eval replaces that with an HTTP-based tool that:

Runs in under a few seconds against whatever base URL you provide.
Sees the real deployed retriever (indexed state, ACLs, any middleware).
Is a direct command, not a testing gymnastics invocation.
Requires no Maven, no JVM — just python3.

The eval/retrieval-queries.csv file is the same format as before; moving from test-classpath resources to a top-level eval/ directory was purely a home-finding exercise. The committed baseline in docs/superpowers/specs/2026-04-17-retrieval-eval-baseline.md is still valid as a reference point — the numbers it reports came from the same underlying Lucene BM25, produced by the JUnit harness at the time.