Retrieval Experiment Harness

The harness at com.wikantik.search.embedding.experiment scores three candidate embedding models against a page-level ground-truth CSV, so we can pick a winner before committing to a pgvector schema on the production search path. It runs entirely outside the wiki's serving code — no WikiEngine, no SearchManager, no REST wiring — and talks to the running wiki only for BM25 via /api/search.

This page is the operating manual. For the design (why three retrievers, why chunk-level dense with max-score-per-page, why sandbox BYTEA instead of vector(n)) see the file-level javadoc in the experiment package.

1. What gets compared

Retriever	Source	Notes
BM25-only	`GET /api/search?q=…` on the running wiki	Lucene lexical baseline
Dense-only	Cosine similarity over per-chunk vectors, aggregated to pages by max-score	One run per candidate model
Hybrid	Reciprocal Rank Fusion (k=60) of the two rankings above

Three candidate models (all served by Ollama at inference.jakefear.com:11434):

Code	Ollama tag	Dimension	Asymmetric prefix
`nomic-embed-v1.5`	`nomic-embed-text:v1.5`	768	`search_query:` / `search_document:`
`bge-m3`	`bge-m3:latest`	1024	none
`qwen3-embedding-0.6b`	`qwen3-embedding:0.6b`	1024	instruction prompt on queries only

Each run produces eval/report-<model>.txt with overall, per-category, and per-query metrics (recall@5, recall@20, MRR). ExperimentCompare then prints a side-by-side table across all three reports.

2. Prerequisites (first-time setup)

Models pulled on the Ollama host. Check with:
```
curl -s http://inference.jakefear.com:11434/api/tags | jq '.models[].name'
```
All three tags above must be present.
Wiki running locally at http://localhost:8080. /api/health should show engine: UP.
kg_content_chunks populated. On a fresh checkout this table is empty — nothing to embed. Populate it by triggering the async rebuild:
```
bin/trigger-rebuild-indexes.sh
```
This posts to /admin/content/rebuild-indexes with the testbot credentials embedded in the (gitignored) script. Rebuild is async; poll until chunking finishes:
```
curl -s -u testbot:<pw> http://localhost:8080/admin/content/index-status | jq
```
Expect ~1K chunks from ~1K markdown pages after a few minutes.
Sandbox DDL. eval/experiment-embeddings.sql creates the dimension-agnostic experiment_embeddings(chunk_id, model_code, dim, vec) table. The runner applies it idempotently.

3. One-shot run

The full pipeline (DDL → indexer × 3 → evaluator × 3 → compare) is wrapped by bin/run-embedding-experiment.sh:

source <(grep -v '^#' test.properties | sed 's/^test.user.//' | sed 's/=/="/' | sed 's/$/"/')

DB_PASSWORD='<jspwiki db pw>' \
WIKI_USER="${login}" WIKI_PASSWORD="${password}" \
    bin/run-embedding-experiment.sh

Required env: DB_PASSWORD, WIKI_USER, WIKI_PASSWORD.

Optional env:

Var	Default	Purpose
`MODELS`	all three codes	Space-separated subset to test
`DB_HOST` / `DB_NAME` / `DB_USER`	`localhost` / `jspwiki` / `jspwiki`
`WIKI_URL`	`http://localhost:8080`
`OUTPUT_DIR`	`eval`	Where reports land
`SKIP_DDL=1`	off	Skip the DDL step
`SKIP_INDEX=1`	off	Skip indexer (re-score existing embeddings)
`MVN_QUIET=1`	off	`-q` on Maven (cuts chatter)

The indexer fails fast with a clear message if kg_content_chunks is empty — you'll see it immediately rather than after a silent 0-row run.

4. Running pieces individually

Each stage is a main() reachable via mvn exec:java. The runner script is just shorthand for these.

Apply the sandbox DDL

PGPASSWORD='<pw>' psql -h localhost -U jspwiki -d jspwiki \
    -f eval/experiment-embeddings.sql

Indexer (once per model)

mvn -pl wikantik-main -am -q exec:java \
    -Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentIndexer \
    -Dexec.args="nomic-embed-v1.5" \
    -Dwikantik.experiment.db.password='<pw>'

Writes embeddings into experiment_embeddings with ON CONFLICT DO NOTHING, so reruns only top up what's missing. Batches of 32 chunks per HTTP call.

Evaluator (once per model)

mvn -pl wikantik-main -am -q exec:java \
    -Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentEvaluator \
    -Dexec.args="nomic-embed-v1.5 eval/retrieval-queries.csv eval/report-nomic.txt" \
    -Dwikantik.experiment.db.password='<pw>' \
    -Dwikantik.experiment.wiki.user=testbot \
    -Dwikantik.experiment.wiki.password='<pw>'

Side-by-side comparison

mvn -pl wikantik-main -am -q exec:java \
    -Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentCompare \
    -Dexec.args="eval/report-nomic-embed-v1.5.txt eval/report-bge-m3.txt eval/report-qwen3-embedding-0.6b.txt"

5. Output

Per-model report (eval/report-<model>.txt):

Retrieval evaluation — model: <model>  dim=<n>
Date: 2026-04-18T…
Queries: 40

Overall:
  retriever  recall@5  recall@20  MRR
  bm25         0.550     0.800   0.519
  dense        <model-dependent>
  hybrid       <model-dependent>

Per-category:
  <7 categories × 3 retrievers>

Per-query (rank of ideal_page; 0 = miss):
  <40 rows>

BM25 baseline is fixed at recall@5=0.550, recall@20=0.800, MRR=0.519 (40 queries, 7 categories) and does not move with the embedding model — Lucene indexes the chunk table, not any vector store.

ExperimentCompare consolidates overall lines across model reports:

model                        retriever  recall@5  recall@20  MRR
nomic-embed-v1.5             bm25         0.550     0.800    0.519
nomic-embed-v1.5             dense        0.625     0.800    0.474
nomic-embed-v1.5             hybrid       0.650     0.900    0.530
bge-m3                       bm25         0.550     0.800    0.519
bge-m3                       dense        0.700     0.875    0.503
bge-m3                       hybrid       0.750     0.900    0.615
qwen3-embedding-0.6b         bm25         0.550     0.800    0.519
qwen3-embedding-0.6b         dense        0.750     0.900    0.490
qwen3-embedding-0.6b         hybrid       0.750     0.925    0.602

Exact numbers from the 2026-04-18 run — the decision-making run documented in Section 7 below.

Path	Purpose
`bin/trigger-rebuild-indexes.sh`	Populate `kg_content_chunks` (gitignored — embeds testbot creds)
`bin/run-embedding-experiment.sh`	End-to-end driver
`eval/experiment-embeddings.sql`	Sandbox DDL (not a migration)
`eval/retrieval-queries.csv`	40-query, 7-category ground truth
`wikantik-main/src/main/java/com/wikantik/search/embedding/experiment/`	`ExperimentIndexer`, `ExperimentEvaluator`, `ExperimentCompare`, `ExperimentAggSweep`, `ExperimentRrfSweep`, `ExperimentFinalSweep`, `ExperimentGrandFinale`, `ExperimentHarness`, `Bm25Client`, `ExperimentDb`, `QueryCorpus`, `ReciprocalRankFusion`, `CosineSimilarity`, `VectorCodec`
`wikantik-main/src/main/java/com/wikantik/search/embedding/`	Production-side client + config (now `enabled=true` by default; feature-flag remains as the kill switch)

The experiment code stays in place after each decision so regression runs stay one Maven command away.

7. Model selection — the 2026-04-18 decision

This is the run that picked qwen3-embedding-0.6b as the production embedding model. All three candidates indexed the same ~30k-chunk corpus, same BM25 baseline, same 40-query / 7-category ground truth, same max-score page aggregation at that point.

Raw results

Model	dim	bm25 r@5	dense r@5	dense r@20	dense MRR	hybrid r@5	hybrid r@20	hybrid MRR
`nomic-embed-v1.5`	768	0.550	0.625	0.800	0.474	0.650	0.900	0.530
`bge-m3`	1024	0.550	0.700	0.875	0.503	0.750	0.900	0.615
`qwen3-embedding-0.6b`	1024	0.550	0.750	0.900	0.490	0.750	0.925	0.602

Reports on disk: eval/report-nomic-embed-v1.5.txt, eval/report-bge-m3.txt, eval/report-qwen3-embedding-0.6b.txt.

Decision rationale

qwen3 leads on recall at both cutoffs. Dense recall@5 is a full +0.050 over bge-m3 and +0.125 over nomic. Dense recall@20 is +0.025 over bge-m3 and +0.100 over nomic. Hybrid recall@20 (0.925) is the best across the three.
bge-m3 leads narrowly on MRR (dense 0.503 vs 0.490; hybrid 0.615 vs 0.602). This was known, weighed, and accepted: for a RAG pipeline where the retrieved set feeds a generative answer model that handles its own reranking, "did the right page make it into the top-K?" is more load-bearing than "exactly where in the top-K did it land?" Recall > MRR for this workload.
nomic-embed-v1.5 loses on every dimension. The 768-dim model trails the 1024-dim models by a consistent margin; the asymmetric prefix (search_query: / search_document:) did not make up the gap on this corpus.

Per-category highlights

Where qwen3's dense recall pulled away:

synonym-drift (7 queries like "blue green release strategy" → BlueGreenDeployments): qwen3 and bge-m3 both hit 1.000 dense recall@5; nomic only 0.571.
hard (5 queries including "k8s" → KubernetesBasics and "ai" → ArtificialIntelligence): qwen3 held 0.600 dense recall@5; bge-m3 also 0.600; nomic only 0.400.
specific (5 queries): qwen3 dense recall@5 = 0.800, same as nomic (1.000) and bge-m3 (0.800) — all three handle exact-name queries fine; this bucket didn't move the decision.

Aggregation sweep — `ExperimentAggSweep`

With qwen3 locked in, the next question was which chunk → page aggregation to use. ExperimentAggSweep produced eval/agg-sweep-qwen3-embedding-0.6b.txt:

aggregation	dense r@5	dense r@20	dense MRR
MAX	0.750	0.900	0.490
MEAN_TOP_3	0.750	0.950	0.576
MEAN_TOP_5	0.750	0.900	0.589
SUM_TOP_3	0.800	0.975	0.602
SUM_TOP_5	0.775	0.925	0.612
MEAN_TOP_3_LOG_NORM	0.350	0.850	0.175

SUM_TOP_3 dominates. MEAN_TOP_3_LOG_NORM is the sanity-check negative result — log-normalising chunk scores before summing kills signal.

Joint sweep — `ExperimentFinalSweep` and `ExperimentGrandFinale`

The final sweeps fan every aggregation across every fusion strategy (dense-only, RRF with three weighting variants, plain score averaging with dense-heavy / bm25-heavy / equal weights) to confirm the winner survives hyperparameter interaction.

Best combination per model (eval/grand-finale.txt):

Model	Best aggregation	Best fusion	r@5	r@20	MRR
nomic-embed-v1.5	SUM_TOP_5	RRF_RECALL	0.750	0.925	0.558
bge-m3	MEAN_TOP_3	SCORE_DENSE_HEAVY	0.775	0.900	0.622
qwen3-embedding-0.6b	SUM_TOP_3	dense-only	0.800	0.975	0.602

qwen3 + SUM_TOP_3 + dense-only wins r@5 and r@20 outright. bge-m3's best-case MRR (0.622) edges qwen3's (0.602), but qwen3 comes within 0.020 at r@5=0.800 vs bge-m3's 0.775. The decision stood: qwen3 with SUM_TOP_3 aggregation.

Production defaults picked from this data

wikantik-main/.../search/hybrid/HybridConfig.java:

DEFAULT_PAGE_AGGREGATION = PageAggregation.SUM_TOP_3;
DEFAULT_RRF_K            = 60;
DEFAULT_BM25_WEIGHT      = 1.0;
DEFAULT_DENSE_WEIGHT     = 1.5;   // dense-heavy, matches SCORE_DENSE_HEAVY
DEFAULT_RRF_TRUNCATE     = 20;
DEFAULT_DENSE_CHUNK_TOP  = 500;
DEFAULT_DENSE_PAGE_TOP   = 100;

Dense is weighted 1.5× vs BM25 in the RRF fusion because the grand finale shows dense-leaning hybrids consistently within 0.05 of the top recall while beating BM25-heavy variants on MRR. RRF k=60 is the standard Cormack/Clarke/Büttcher default carried from the literature — the sweep confirmed no nearby value beat it on our corpus by enough to justify a non-conventional choice.

8. Chunker improvement results (2026-04-19)

Two targeted changes to the chunking + embedding pipeline landed together and were evaluated against the frozen qwen3-embedding-0.6b baseline. Both are structural — no retriever, fusion weights, or query-side logic changed.

What changed

Atomic list chunking. ContentChunker.isAtomic(Node) now treats BulletList and OrderedList as indivisible up to maxTokens × 4 (≈ 2048 tokens). Previously, Flexmark emitted each list item as a separate block and the merge pass sometimes split related items across chunks. Lists of command flags, step-by-step instructions, and config options now live in one chunk with their siblings.
Heading path prepended at embed time. A new EmbeddingTextBuilder renders "<Top> > <Mid> > <Leaf>\n\n<body>" and is the single rendering point for both EmbeddingIndexService (production) and ExperimentIndexer (sandbox). The stored chunk text in kg_content_chunks.text stays body-only; the heading-path-aware string only exists on the wire to the embedder. Chunk identity (content_hash = sha256(heading_path + text)) is unchanged.

Corpus impact

	Before	After
`kg_content_chunks` rows	23,656	39,264
Avg tokens / chunk	~230	103

More, smaller chunks — atomic lists stop the merge pass from gluing unrelated blocks together, so prose paragraphs are no longer inflated by adjacent list content.

Retrieval metrics (40 queries, 7 categories)

Overall:

retriever	recall@5	recall@20	MRR
bm25	0.550 → 0.775 (+0.225)	0.800 → 0.975 (+0.175)	0.519 → 0.650 (+0.131)
dense	0.750 → 0.750 (+0.000)	0.900 → 0.950 (+0.050)	0.490 → 0.627 (+0.137)
hybrid	0.750 → 0.850 (+0.100)	0.925 → 0.975 (+0.050)	0.602 → 0.667 (+0.065)

Categories that moved the most:

business-process bm25 recall@5: 0.400 → 0.800
hard bm25 recall@20: 0.600 → 1.000; dense recall@20: 0.800 → 1.000
indirect bm25 MRR: 0.458 → 0.614 (every retriever now recall@20 = 1.000)
general dense MRR: 0.461 → 0.667

Why the gains break down this way

BM25 jumped more than expected. BM25 in this harness indexes over the chunk table, not pages. When a query's answer is "item 3 of the --force options list," splitting that list across chunks left BM25 with a diluted hit. Atomic lists put every related term in one chunk and BM25 lights up cleanly. The hard, indirect, and business-process categories — all heavy on flag/step/option queries — moved the most.
Dense recall@5 was already saturated at 0.750 and didn't move, but dense MRR jumped +0.137. Heading-path context doesn't widen the net; it pulls the correct chunk up in rank. That's exactly the precision-not-recall win the change was designed to produce.
Hybrid is the winner at recall@5 = 0.850 and recall@20 = 0.975 — best across every category. Because BM25 and dense both improved but on different axes (recall vs. rank), RRF compounds the gains.

What this means for overlap

Overlap (replaying the last N tokens of chunk i as the first N tokens of chunk i+1) was the obvious next lever — until these two changes absorbed most of what overlap was meant to fix:

Boundary bleed on list items → gone; lists are atomic.
Context-poor chunks → gone; heading path is on the wire.

Recall@20 hybrid is 0.975. There are 39 of 40 queries recovered; the miss budget for overlap to improve against is one query. If overlap is worth revisiting, the signal will show up in dense recall@5 (stuck at 0.750), not in the hybrid overall.

Reports on disk

Path	What
`eval/report-qwen3-embedding-0.6b-baseline-prechunk.txt`	Before, 2026-04-18
`eval/report-qwen3-embedding-0.6b-2026-04-19T20-01-22-331504860Z.txt`	After, 2026-04-19

Reproduce with:

mvn -pl wikantik-main exec:java \
    -Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentCompare \
    -Dexec.args="<before.txt> <after.txt>"

9. Chunker rebuild — merge-forward floor raised (2026-04-23)

During the Phase 2 entity-extractor benchmark work it became clear that chunk count, not chunk size, was the bottleneck on full-corpus batch extraction: at ~39k chunks the projected extractor wall-clock was ~95 h on the shipping model. Inspection of ContentChunker.java revealed two advertised config keys (target_tokens, min_tokens) that were never referenced in the class — dead knobs — and one lever, merge_forward_tokens, that actually controls chunk consolidation.

What changed

Removed the dead target_tokens and min_tokens fields from ContentChunker.Config and from wikantik.properties.
Raised wikantik.chunker.merge_forward_tokens default from 8 → 150. Sections whose accumulated text is below this threshold are now held and merged into the next section rather than emitted immediately, so short sibling sections (typical wiki "Overview", "Introduction", "Notes" blocks) coalesce into chunks of reasonable size.
Commit: 2acccf102 feat(kg-rag): phase 1-3 uplift.

Corpus impact

	Before	After
`kg_content_chunks` rows	39,264	23,256 (−41%)
`content_chunk_embeddings` rows	39,264	23,256
Mean tokens / chunk	103	174 (+69%)
p50 tokens / chunk	77	166
p95 tokens / chunk	261	335
Max tokens / chunk	1,963	1,963 (atomic blocks unchanged)
Embedding re-index wall-clock (qwen3-embedding-0.6b)	—	6m 18s

Retrieval quality — live search top-10 diff

The retrieval harness was not re-run against the new chunks (the extractor benchmark was the day's priority and the harness requires a full corpus re-embed loop plus the 40-query evaluation pass). Instead, spot-check comparison on live /api/search against two high-traffic queries, with graph rerank disabled (mentions=0 after the rebuild cascaded them all):

"knowledge graph" — top 10:

Rank	Old 39k chunks	New 23k chunks
1	WikantikKnowledgeGraphAdmin	InventionOfKnowledgeGraph
2	InventionOfKnowledgeGraph	WikantikKnowledgeGraphAdmin
3	KnowledgeGraphCore	KnowledgeGraphCore
4	KnowledgeGraphDogfooding	KnowledgeGraphVsRelationalDatabase
5	KnowledgeGraphVsRelationalDatabase	GraphRAG
6	GraphRAG	KnowledgeGraphsAndManagement
7	IndustrialKnowledgeGraphUseCases	IndustrialKnowledgeGraphUseCases
8	KnowledgeGraphsAndManagement	FederatedKnowledgeGraphs
9	KnowledgeGraphCompletion	KnowledgeGraphCompletion
10	FederatedKnowledgeGraphs	KnowledgeGraphConstructionPipeline

Set overlap: 8/10. Dropped: KnowledgeGraphDogfooding. Added: KnowledgeGraphConstructionPipeline. Top-3 set preserved, positions 1–2 swapped (both highly relevant).

"GraphRAG" — top 10 essentially identical, set overlap 8/10 with one substitution at position 8 (AiFunctionCallingAndToolUse → AiMemoryAndPersistence).

Why it's not a regression

BM25 is chunk-indexed in this harness but page-indexed in production Lucene. The 04-19 +0.225 BM25 recall@5 gain was from atomic-list chunking giving lexical matches tighter scope; today's merge-forward move in the opposite direction (coarser chunks) would lose some of that BM25 gain in the harness. On the production path, Lucene indexes whole pages, so merge-forward has zero BM25 effect.
Dense on 174-token chunks is slightly less peaky than on 103-token chunks. Chunks now containing multiple related sub-topics can win queries they previously missed; chunks narrowly winning on a focused topic sentence may slip a rank or two. Net: 80% top-10 overlap, slight topology shuffle, no new misses observed.
The chunker rebuild was motivated by extraction cost, not retrieval quality. It trades a small dense-retrieval peakiness hit for a 2× reduction in chunk count → proportional reduction in extractor batch wall-clock. For the corpus we have, that's the right trade; the graph rerank layer that's about to get populated more than makes up for the dense-retrieval peakiness.

If we want the harness number

Re-running the harness against the new chunks takes:

bin/trigger-rebuild-indexes.sh          # already done — 23k chunks live
# Drop and recreate the sandbox embeddings (qwen3's prior vectors are
# still keyed by the old chunk ids which were cascaded away by V011):
PGPASSWORD='…' psql -h localhost -U jspwiki -d jspwiki -c "DELETE FROM experiment_embeddings WHERE model_code='qwen3-embedding-0.6b'"
# Re-index + evaluate:
mvn -pl wikantik-main exec:java \
    -Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentIndexer \
    -Dexec.args="qwen3-embedding-0.6b" \
    -Dwikantik.experiment.db.password='…'
mvn -pl wikantik-main exec:java \
    -Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentEvaluator \
    -Dexec.args="qwen3-embedding-0.6b eval/retrieval-queries.csv \
                 eval/report-qwen3-embedding-0.6b-2026-04-23-postmerge.txt" \
    -Dwikantik.experiment.db.password='…' \
    -Dwikantik.experiment.wiki.user=testbot \
    -Dwikantik.experiment.wiki.password='…'

Pending: not blocking extraction work, but the right next retrieval-side regression run to close the loop.

10. Evolution — from scratch to production

Condensed git-log narrative for anyone who needs to understand how each piece got here.

Date	Commit	What
2026-04-14	`1da3a7dce`	Initial `Chunk` record and minimal `ContentChunker`
2026-04-14	`64a8de3ff`	Heading-aware splitting with `heading_path`
2026-04-14	`185dd80cc`	Token budget, atomic blocks, merge-forward
2026-04-14	`f141524fe`	Explicit `mergeForwardTokens` Config field
2026-04-14	`9b411eeea`	`ContentChunkRepository` with diff apply + stats
2026-04-15	`c386f21b0`	Save-time `ChunkProjector` page filter
2026-04-15	`632737f58`	Prometheus metrics for chunker and rebuild
2026-04-15	`e8966adb`	Async page-save listener for incremental embedding reindex
2026-04-16	`a9a199041`	Stub `TextEmbeddingClient` + `EmbeddingKind` for Phase 1
2026-04-16	`46ad1441e`	Caffeine dep + `TextEmbeddingClient` stub for Phase 4
2026-04-16	`a9f2ae876`	`EmbeddingIndexService` — production chunk-embedding data layer
2026-04-17	`c0dabd53f`	`DenseRetriever` + placeholder `ChunkVectorIndex`
2026-04-17	`c169604ce`	`HybridFuser` for weighted RRF of BM25 and dense lists
2026-04-17	`6b6527572`	`PageAggregation` + `PageAggregator`
2026-04-17	`8717e533f`	`QueryEmbedderConfig`, `CircuitState`, metrics snapshot
2026-04-17	`4e3fe9921`	Hand-rolled CLOSED/OPEN/HALF_OPEN circuit breaker
2026-04-17	`4156d7a1f`	`QueryEmbedder` wraps embedding client with cache + timeout + breaker
2026-04-17	`e9bf62d1e`	`InMemoryChunkVectorIndex` for dense top-k
2026-04-17	`0825bb97b`	Ollama embedding client and model registry
2026-04-18	`5211ea391`	Retrieval experiment harness + first model-comparison reports
2026-04-18	`373a024a2`	`HybridConfig` with defaults matching the winning experiment
2026-04-18	`376cdb877`	Phase 3: hybrid retrieval core (PageAggregation, HybridFuser, DenseRetriever)
2026-04-19	`b6a86fba7`	Hybrid perf pass: parallelized embedding, incremental index, heading-aware context
2026-04-19	`c4447350d`	Release v1.1.6: hybrid retrieval, MCP access hardening, admin content ops
2026-04-23	`2acccf102`	KG-RAG Phase 1-3: unified embeddings, extractor pipeline, graph-aware rerank + chunker merge-forward 8 → 150
2026-04-23	`c289fbdd7`	Standalone extract-CLI for Tomcat-less batch runs

The two inflection points:

2026-04-18 — the experiment harness ran, qwen3-embedding-0.6b won, HybridConfig was frozen with SUM_TOP_3 aggregation and dense-heavy RRF. That's the moment "dense search" stopped being an experiment and started being the production search path.
2026-04-23 — the Phase 3 graph rerank layer added mentions as a new input signal on top of hybrid, and the chunker floor was raised to make whole-corpus entity extraction operationally affordable. That turns the harness's page-ranking story from "what's our best hybrid score" into "what's our best hybrid score plus graph-derived proximity boost" — and re-running the harness after the extractor batch populates mentions is the next regression checkpoint.

Retrieval Experiment Harness

1. What gets compared

2. Prerequisites (first-time setup)

3. One-shot run

4. Running pieces individually

Apply the sandbox DDL

Indexer (once per model)

Evaluator (once per model)

Side-by-side comparison

5. Output

6. Related files

7. Model selection — the 2026-04-18 decision

Raw results

Decision rationale

Per-category highlights

Aggregation sweep — ExperimentAggSweep

Joint sweep — ExperimentFinalSweep and ExperimentGrandFinale

Production defaults picked from this data

8. Chunker improvement results (2026-04-19)

What changed

Corpus impact

Retrieval metrics (40 queries, 7 categories)

Why the gains break down this way

What this means for overlap

Reports on disk

9. Chunker rebuild — merge-forward floor raised (2026-04-23)

What changed

Corpus impact

Retrieval quality — live search top-10 diff

Why it's not a regression

If we want the harness number

10. Evolution — from scratch to production

Aggregation sweep — `ExperimentAggSweep`

Joint sweep — `ExperimentFinalSweep` and `ExperimentGrandFinale`