Retrieval Experiment Harness

The harness at `com.wikantik.search.embedding.experiment` scores three candidate

embedding models against a page-level ground-truth CSV, so we can pick a winner

before committing to a pgvector schema on the production search path. It runs

entirely outside the wiki's serving code — no `WikiEngine`, no `SearchManager`,

no REST wiring — and talks to the running wiki only for BM25 via

`/api/search`.

This page is the operating manual. For the *design* (why three retrievers, why

chunk-level dense with max-score-per-page, why sandbox BYTEA instead of

`vector(n)`) see the file-level javadoc in the `experiment` package.

---

1. What gets compared

| Retriever | Source | Notes |

|---|---|---|

| **BM25-only** | `GET /api/search?q=…` on the running wiki | Lucene lexical baseline |

| **Dense-only** | Cosine similarity over per-chunk vectors, aggregated to pages by max-score | One run per candidate model |

| **Hybrid** | Reciprocal Rank Fusion (k=60) of the two rankings above | |

Three candidate models (all served by Ollama at `inference.jakefear.com:11434`):

| Code | Ollama tag | Dimension | Asymmetric prefix |

|---|---|---|---|

| `nomic-embed-v1.5` | `nomic-embed-text:v1.5` | 768 | `search_query:` / `search_document:` |

| `bge-m3` | `bge-m3:latest` | 1024 | none |

| `qwen3-embedding-0.6b` | `qwen3-embedding:0.6b` | 1024 | instruction prompt on queries only |

Each run produces `eval/report-<model>.txt` with overall, per-category, and

per-query metrics (recall@5, recall@20, MRR). `ExperimentCompare` then prints

a side-by-side table across all three reports.

---

2. Prerequisites (first-time setup)

1. **Models pulled on the Ollama host.** Check with:

```

curl -s http://inference.jakefear.com:11434/api/tags | jq '.models[].name'

```

All three tags above must be present.

2. **Wiki running locally** at `http://localhost:8080`. `/api/health` should

show `engine: UP`.

3. **`kg_content_chunks` populated.** On a fresh checkout this table is empty

— nothing to embed. Populate it by triggering the async rebuild:

```

bin/trigger-rebuild-indexes.sh

```

This posts to `/admin/content/rebuild-indexes` with the `testbot`

credentials embedded in the (gitignored) script. Rebuild is async; poll

until chunking finishes:

```

curl -s -u testbot:<pw> http://localhost:8080/admin/content/index-status | jq

```

Expect ~1K chunks from ~1K markdown pages after a few minutes.

4. **Sandbox DDL.** `eval/experiment-embeddings.sql` creates the

dimension-agnostic `experiment_embeddings(chunk_id, model_code, dim, vec)`

table. The runner applies it idempotently.

---

3. One-shot run

The full pipeline (DDL → indexer × 3 → evaluator × 3 → compare) is wrapped by

`bin/run-embedding-experiment.sh`:

```

source <(grep -v '^#' test.properties | sed 's/^test.user.//' | sed 's/=/="/' | sed 's/$/"/')

DB_PASSWORD='<jspwiki db pw>' \

WIKI_USER="${login}" WIKI_PASSWORD="${password}" \

bin/run-embedding-experiment.sh

```

Required env: `DB_PASSWORD`, `WIKI_USER`, `WIKI_PASSWORD`.

Optional env:

| Var | Default | Purpose |

|---|---|---|

| `MODELS` | all three codes | Space-separated subset to test |

| `DB_HOST` / `DB_NAME` / `DB_USER` | `localhost` / `jspwiki` / `jspwiki` | |

| `WIKI_URL` | `http://localhost:8080` | |

| `OUTPUT_DIR` | `eval` | Where reports land |

| `SKIP_DDL=1` | off | Skip the DDL step |

| `SKIP_INDEX=1` | off | Skip indexer (re-score existing embeddings) |

| `MVN_QUIET=1` | off | `-q` on Maven (cuts chatter) |

The indexer fails fast with a clear message if `kg_content_chunks` is empty —

you'll see it immediately rather than after a silent 0-row run.

---

4. Running pieces individually

Each stage is a `main()` reachable via `mvn exec:java`. The runner script is

just shorthand for these.

Apply the sandbox DDL

```

PGPASSWORD='<pw>' psql -h localhost -U jspwiki -d jspwiki \

-f eval/experiment-embeddings.sql

```

Indexer (once per model)

```

mvn -pl wikantik-main -am -q exec:java \

-Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentIndexer \

-Dexec.args="nomic-embed-v1.5" \

-Dwikantik.experiment.db.password='<pw>'

```

Writes embeddings into `experiment_embeddings` with `ON CONFLICT DO NOTHING`,

so reruns only top up what's missing. Batches of 32 chunks per HTTP call.

Evaluator (once per model)

```

mvn -pl wikantik-main -am -q exec:java \

-Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentEvaluator \

-Dexec.args="nomic-embed-v1.5 eval/retrieval-queries.csv eval/report-nomic.txt" \

-Dwikantik.experiment.db.password='<pw>' \

-Dwikantik.experiment.wiki.user=testbot \

-Dwikantik.experiment.wiki.password='<pw>'

```

Side-by-side comparison

```

mvn -pl wikantik-main -am -q exec:java \

-Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentCompare \

-Dexec.args="eval/report-nomic-embed-v1.5.txt eval/report-bge-m3.txt eval/report-qwen3-embedding-0.6b.txt"

```

---

5. Output

Per-model report (`eval/report-<model>.txt`):

```

Retrieval evaluation — model: <model> dim=<n>

Date: 2026-04-18T…

Queries: 40

Overall:

retriever recall@5 recall@20 MRR

bm25 0.550 0.800 0.519

dense <model-dependent>

hybrid <model-dependent>

Per-category:

<7 categories × 3 retrievers>

Per-query (rank of ideal_page; 0 = miss):

<40 rows>

```

BM25 baseline is fixed at **recall@5=0.550, recall@20=0.800, MRR=0.519**

(40 queries, 7 categories) and does not move with the embedding model —

Lucene indexes the chunk table, not any vector store.

`ExperimentCompare` consolidates overall lines across model reports:

```

model retriever recall@5 recall@20 MRR

nomic-embed-v1.5 bm25 0.550 0.800 0.519

nomic-embed-v1.5 dense 0.625 0.800 0.474

nomic-embed-v1.5 hybrid 0.650 0.900 0.530

bge-m3 bm25 0.550 0.800 0.519

bge-m3 dense 0.700 0.875 0.503

bge-m3 hybrid 0.750 0.900 0.615

qwen3-embedding-0.6b bm25 0.550 0.800 0.519

qwen3-embedding-0.6b dense 0.750 0.900 0.490

qwen3-embedding-0.6b hybrid 0.750 0.925 0.602

```

Exact numbers from the 2026-04-18 run — the decision-making run documented

in Section 7 below.

---

6. Related files

| Path | Purpose |

|---|---|

| `bin/trigger-rebuild-indexes.sh` | Populate `kg_content_chunks` (gitignored — embeds testbot creds) |

| `bin/run-embedding-experiment.sh` | End-to-end driver |

| `eval/experiment-embeddings.sql` | Sandbox DDL (not a migration) |

| `eval/retrieval-queries.csv` | 40-query, 7-category ground truth |

| `wikantik-main/src/main/java/com/wikantik/search/embedding/experiment/` | `ExperimentIndexer`, `ExperimentEvaluator`, `ExperimentCompare`, `ExperimentAggSweep`, `ExperimentRrfSweep`, `ExperimentFinalSweep`, `ExperimentGrandFinale`, `ExperimentHarness`, `Bm25Client`, `ExperimentDb`, `QueryCorpus`, `ReciprocalRankFusion`, `CosineSimilarity`, `VectorCodec` |

| `wikantik-main/src/main/java/com/wikantik/search/embedding/` | Production-side client + config (now `enabled=true` by default; feature-flag remains as the kill switch) |

The experiment code stays in place after each decision so regression runs

stay one Maven command away.

---

7. Model selection — the 2026-04-18 decision

This is the run that picked `qwen3-embedding-0.6b` as the production

embedding model. All three candidates indexed the same ~30k-chunk corpus,

same BM25 baseline, same 40-query / 7-category ground truth, same

max-score page aggregation at that point.

Raw results

| Model | dim | bm25 r@5 | dense r@5 | dense r@20 | dense MRR | hybrid r@5 | hybrid r@20 | hybrid MRR |

|---|---|---|---|---|---|---|---|---|

| `nomic-embed-v1.5` | 768 | 0.550 | 0.625 | 0.800 | 0.474 | 0.650 | 0.900 | 0.530 |

| `bge-m3` | 1024 | 0.550 | 0.700 | 0.875 | 0.503 | 0.750 | 0.900 | 0.615 |

| **`qwen3-embedding-0.6b`** | 1024 | 0.550 | **0.750** | **0.900** | 0.490 | **0.750** | **0.925** | 0.602 |

Reports on disk: `eval/report-nomic-embed-v1.5.txt`,

`eval/report-bge-m3.txt`, `eval/report-qwen3-embedding-0.6b.txt`.

Decision rationale

- **qwen3 leads on recall at both cutoffs.** Dense recall@5 is a full

+0.050 over bge-m3 and +0.125 over nomic. Dense recall@20 is +0.025

over bge-m3 and +0.100 over nomic. Hybrid recall@20 (0.925) is the

best across the three.

- **bge-m3 leads narrowly on MRR** (dense 0.503 vs 0.490; hybrid 0.615

vs 0.602). This was known, weighed, and accepted: for a RAG pipeline

where the retrieved set feeds a generative answer model that handles

its own reranking, "did the right page make it into the top-K?" is

more load-bearing than "exactly where in the top-K did it land?"

Recall > MRR for this workload.

- **nomic-embed-v1.5 loses on every dimension.** The 768-dim model

trails the 1024-dim models by a consistent margin; the asymmetric

prefix (`search_query:` / `search_document:`) did not make up the

gap on this corpus.

Per-category highlights

Where qwen3's dense recall pulled away:

- `synonym-drift` (7 queries like "blue green release strategy" →

`BlueGreenDeployments`): qwen3 and bge-m3 both hit 1.000 dense

recall@5; nomic only 0.571.

- `hard` (5 queries including `"k8s"` → `KubernetesBasics` and `"ai"`

→ `ArtificialIntelligence`): qwen3 held 0.600 dense recall@5;

bge-m3 also 0.600; nomic only 0.400.

- `specific` (5 queries): qwen3 dense recall@5 = 0.800, same as

nomic (1.000) and bge-m3 (0.800) — all three handle exact-name

queries fine; this bucket didn't move the decision.

Aggregation sweep — `ExperimentAggSweep`

With qwen3 locked in, the next question was which chunk → page

aggregation to use. `ExperimentAggSweep` produced

`eval/agg-sweep-qwen3-embedding-0.6b.txt`:

| aggregation | dense r@5 | dense r@20 | dense MRR |

|---|---|---|---|

| MAX | 0.750 | 0.900 | 0.490 |

| MEAN_TOP_3 | 0.750 | 0.950 | 0.576 |

| MEAN_TOP_5 | 0.750 | 0.900 | 0.589 |

| **SUM_TOP_3** | **0.800** | **0.975** | **0.602** |

| SUM_TOP_5 | 0.775 | 0.925 | 0.612 |

| MEAN_TOP_3_LOG_NORM | 0.350 | 0.850 | 0.175 |

`SUM_TOP_3` dominates. MEAN_TOP_3_LOG_NORM is the sanity-check negative

result — log-normalising chunk scores before summing kills signal.

Joint sweep — `ExperimentFinalSweep` and `ExperimentGrandFinale`

The final sweeps fan every aggregation across every fusion strategy

(dense-only, RRF with three weighting variants, plain score averaging

with dense-heavy / bm25-heavy / equal weights) to confirm the winner

survives hyperparameter interaction.

Best combination per model (`eval/grand-finale.txt`):

| Model | Best aggregation | Best fusion | r@5 | r@20 | MRR |

|---|---|---|---|---|---|

| nomic-embed-v1.5 | SUM_TOP_5 | RRF_RECALL | 0.750 | 0.925 | 0.558 |

| bge-m3 | MEAN_TOP_3 | SCORE_DENSE_HEAVY | 0.775 | 0.900 | 0.622 |

| **qwen3-embedding-0.6b** | **SUM_TOP_3** | **dense-only** | **0.800** | **0.975** | **0.602** |

qwen3 + SUM_TOP_3 + dense-only wins r@5 and r@20 outright. bge-m3's

best-case MRR (0.622) edges qwen3's (0.602), but qwen3 comes

within 0.020 at r@5=0.800 vs bge-m3's 0.775. The decision stood:

**qwen3 with SUM_TOP_3 aggregation**.

Production defaults picked from this data

`wikantik-main/.../search/hybrid/HybridConfig.java`:

```

DEFAULT_PAGE_AGGREGATION = PageAggregation.SUM_TOP_3;

DEFAULT_RRF_K = 60;

DEFAULT_BM25_WEIGHT = 1.0;

DEFAULT_DENSE_WEIGHT = 1.5; // dense-heavy, matches SCORE_DENSE_HEAVY

DEFAULT_RRF_TRUNCATE = 20;

DEFAULT_DENSE_CHUNK_TOP = 500;

DEFAULT_DENSE_PAGE_TOP = 100;

```

Dense is weighted 1.5× vs BM25 in the RRF fusion because the grand

finale shows dense-leaning hybrids consistently within 0.05 of the

top recall while beating BM25-heavy variants on MRR. `RRF k=60` is

the standard Cormack/Clarke/Büttcher default carried from the

literature — the sweep confirmed no nearby value beat it on our

corpus by enough to justify a non-conventional choice.

---

8. Chunker improvement results (2026-04-19)

Two targeted changes to the chunking + embedding pipeline landed together and

were evaluated against the frozen `qwen3-embedding-0.6b` baseline. Both are

structural — no retriever, fusion weights, or query-side logic changed.

What changed

1. **Atomic list chunking.** `ContentChunker.isAtomic(Node)` now treats

`BulletList` and `OrderedList` as indivisible up to `maxTokens × 4` (≈ 2048

tokens). Previously, Flexmark emitted each list item as a separate block

and the merge pass sometimes split related items across chunks. Lists of

command flags, step-by-step instructions, and config options now live in

one chunk with their siblings.

2. **Heading path prepended at embed time.** A new `EmbeddingTextBuilder`

renders `"<Top> > <Mid> > <Leaf>\n\n<body>"` and is the single rendering

point for both `EmbeddingIndexService` (production) and `ExperimentIndexer`

(sandbox). The stored chunk text in `kg_content_chunks.text` stays

body-only; the heading-path-aware string only exists on the wire to the

embedder. Chunk identity (`content_hash` = sha256(heading_path + text))

is unchanged.

Corpus impact

| | Before | After |

|---|---|---|

| `kg_content_chunks` rows | 23,656 | 39,264 |

| Avg tokens / chunk | ~230 | 103 |

More, smaller chunks — atomic lists stop the merge pass from gluing unrelated

blocks together, so prose paragraphs are no longer inflated by adjacent list

content.

Retrieval metrics (40 queries, 7 categories)

**Overall:**

| retriever | recall@5 | recall@20 | MRR |

|---|---|---|---|

| bm25 | 0.550 → **0.775** (+0.225) | 0.800 → **0.975** (+0.175) | 0.519 → **0.650** (+0.131) |

| dense | 0.750 → 0.750 (+0.000) | 0.900 → **0.950** (+0.050) | 0.490 → **0.627** (+0.137) |

| hybrid | 0.750 → **0.850** (+0.100) | 0.925 → **0.975** (+0.050) | 0.602 → **0.667** (+0.065) |

**Categories that moved the most:**

- `business-process` bm25 recall@5: 0.400 → 0.800

- `hard` bm25 recall@20: 0.600 → 1.000; dense recall@20: 0.800 → 1.000

- `indirect` bm25 MRR: 0.458 → 0.614 (every retriever now recall@20 = 1.000)

- `general` dense MRR: 0.461 → 0.667

Why the gains break down this way

- **BM25 jumped more than expected.** BM25 in this harness indexes over the

chunk table, not pages. When a query's answer is "item 3 of the `--force`

options list," splitting that list across chunks left BM25 with a diluted

hit. Atomic lists put every related term in one chunk and BM25 lights up

cleanly. The `hard`, `indirect`, and `business-process` categories — all

heavy on flag/step/option queries — moved the most.

- **Dense recall@5 was already saturated at 0.750 and didn't move, but dense

MRR jumped +0.137.** Heading-path context doesn't widen the net; it pulls

the correct chunk up in rank. That's exactly the precision-not-recall win

the change was designed to produce.

- **Hybrid is the winner** at recall@5 = 0.850 and recall@20 = 0.975 — best

across every category. Because BM25 and dense both improved but on different

axes (recall vs. rank), RRF compounds the gains.

What this means for overlap

Overlap (replaying the last N tokens of chunk *i* as the first N tokens of

chunk *i+1*) was the obvious next lever — until these two changes absorbed

most of what overlap was meant to fix:

- Boundary bleed on list items → gone; lists are atomic.

- Context-poor chunks → gone; heading path is on the wire.

Recall@20 hybrid is 0.975. There are 39 of 40 queries recovered; the miss

budget for overlap to improve against is one query. If overlap is worth

revisiting, the signal will show up in **dense recall@5** (stuck at 0.750),

not in the hybrid overall.

Reports on disk

| Path | What |

|---|---|

| `eval/report-qwen3-embedding-0.6b-baseline-prechunk.txt` | Before, 2026-04-18 |

| `eval/report-qwen3-embedding-0.6b-2026-04-19T20-01-22-331504860Z.txt` | After, 2026-04-19 |

Reproduce with:

```

mvn -pl wikantik-main exec:java \

-Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentCompare \

-Dexec.args="<before.txt> <after.txt>"

```

---

9. Chunker rebuild — merge-forward floor raised (2026-04-23)

During the Phase 2 entity-extractor benchmark work it became clear that

chunk *count*, not chunk *size*, was the bottleneck on full-corpus batch

extraction: at ~39k chunks the projected extractor wall-clock was ~95 h

on the shipping model. Inspection of `ContentChunker.java` revealed two

advertised config keys (`target_tokens`, `min_tokens`) that were never

referenced in the class — dead knobs — and one lever,

`merge_forward_tokens`, that actually controls chunk consolidation.

What changed

- Removed the dead `target_tokens` and `min_tokens` fields from

`ContentChunker.Config` and from `wikantik.properties`.

- Raised `wikantik.chunker.merge_forward_tokens` default from **8 → 150**.

Sections whose accumulated text is below this threshold are now held

and merged into the next section rather than emitted immediately, so

short sibling sections (typical wiki "Overview", "Introduction",

"Notes" blocks) coalesce into chunks of reasonable size.

- Commit: `2acccf102 feat(kg-rag): phase 1-3 uplift`.

Corpus impact

| | Before | After |

|---|---|---|

| `kg_content_chunks` rows | 39,264 | **23,256** (−41%) |

| `content_chunk_embeddings` rows | 39,264 | 23,256 |

| Mean tokens / chunk | 103 | **174** (+69%) |

| p50 tokens / chunk | 77 | 166 |

| p95 tokens / chunk | 261 | 335 |

| Max tokens / chunk | 1,963 | 1,963 (atomic blocks unchanged) |

| Embedding re-index wall-clock (qwen3-embedding-0.6b) | — | 6m 18s |

Retrieval quality — live search top-10 diff

The retrieval harness was not re-run against the new chunks (the extractor

benchmark was the day's priority and the harness requires a full corpus

re-embed loop plus the 40-query evaluation pass). Instead, spot-check

comparison on live `/api/search` against two high-traffic queries, with

graph rerank disabled (mentions=0 after the rebuild cascaded them all):

**`"knowledge graph"`** — top 10:

| Rank | Old 39k chunks | New 23k chunks |

|---|---|---|

| 1 | WikantikKnowledgeGraphAdmin | InventionOfKnowledgeGraph |

| 2 | InventionOfKnowledgeGraph | WikantikKnowledgeGraphAdmin |

| 3 | KnowledgeGraphCore | KnowledgeGraphCore |

| 4 | KnowledgeGraphDogfooding | KnowledgeGraphVsRelationalDatabase |

| 5 | KnowledgeGraphVsRelationalDatabase | GraphRAG |

| 6 | GraphRAG | KnowledgeGraphsAndManagement |

| 7 | IndustrialKnowledgeGraphUseCases | IndustrialKnowledgeGraphUseCases |

| 8 | KnowledgeGraphsAndManagement | FederatedKnowledgeGraphs |

| 9 | KnowledgeGraphCompletion | KnowledgeGraphCompletion |

| 10 | FederatedKnowledgeGraphs | KnowledgeGraphConstructionPipeline |

Set overlap: **8/10**. Dropped: `KnowledgeGraphDogfooding`. Added:

`KnowledgeGraphConstructionPipeline`. Top-3 set preserved, positions

1–2 swapped (both highly relevant).

**`"GraphRAG"`** — top 10 essentially identical, set overlap 8/10 with

one substitution at position 8 (`AiFunctionCallingAndToolUse` →

`AiMemoryAndPersistence`).

Why it's not a regression

- **BM25 is chunk-indexed in this harness but page-indexed in production

Lucene.** The 04-19 +0.225 BM25 recall@5 gain was from atomic-list

chunking giving lexical matches tighter scope; today's merge-forward

move in the opposite direction (coarser chunks) would *lose* some of

that BM25 gain in the harness. On the production path, Lucene indexes

whole pages, so merge-forward has zero BM25 effect.

- **Dense on 174-token chunks is slightly less peaky than on 103-token

chunks.** Chunks now containing multiple related sub-topics can win

queries they previously missed; chunks narrowly winning on a focused

topic sentence may slip a rank or two. Net: 80% top-10 overlap, slight

topology shuffle, no new misses observed.

- **The chunker rebuild was motivated by extraction cost, not retrieval

quality.** It trades a small dense-retrieval peakiness hit for a 2×

reduction in chunk count → proportional reduction in extractor batch

wall-clock. For the corpus we have, that's the right trade; the

graph rerank layer that's about to get populated more than makes up

for the dense-retrieval peakiness.

If we want the harness number

Re-running the harness against the new chunks takes:

```

bin/trigger-rebuild-indexes.sh # already done — 23k chunks live

Drop and recreate the sandbox embeddings (qwen3's prior vectors are

still keyed by the old chunk ids which were cascaded away by V011):

PGPASSWORD='…' psql -h localhost -U jspwiki -d jspwiki -c "DELETE FROM experiment_embeddings WHERE model_code='qwen3-embedding-0.6b'"

Re-index + evaluate:

mvn -pl wikantik-main exec:java \

-Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentIndexer \

-Dexec.args="qwen3-embedding-0.6b" \

-Dwikantik.experiment.db.password='…'

mvn -pl wikantik-main exec:java \

-Dexec.mainClass=com.wikantik.search.embedding.experiment.ExperimentEvaluator \

-Dexec.args="qwen3-embedding-0.6b eval/retrieval-queries.csv \

eval/report-qwen3-embedding-0.6b-2026-04-23-postmerge.txt" \

-Dwikantik.experiment.db.password='…' \

-Dwikantik.experiment.wiki.user=testbot \

-Dwikantik.experiment.wiki.password='…'

```

Pending: not blocking extraction work, but the right next retrieval-side

regression run to close the loop.

---

10. Evolution — from scratch to production

Condensed git-log narrative for anyone who needs to understand how each

piece got here.

| Date | Commit | What |

|---|---|---|

| 2026-04-14 | `1da3a7dce` | Initial `Chunk` record and minimal `ContentChunker` |

| 2026-04-14 | `64a8de3ff` | Heading-aware splitting with `heading_path` |

| 2026-04-14 | `185dd80cc` | Token budget, atomic blocks, merge-forward |

| 2026-04-14 | `f141524fe` | Explicit `mergeForwardTokens` Config field |

| 2026-04-14 | `9b411eeea` | `ContentChunkRepository` with diff apply + stats |

| 2026-04-15 | `c386f21b0` | Save-time `ChunkProjector` page filter |

| 2026-04-15 | `632737f58` | Prometheus metrics for chunker and rebuild |

| 2026-04-15 | `e8966adb` | Async page-save listener for incremental embedding reindex |

| 2026-04-16 | `a9a199041` | Stub `TextEmbeddingClient` + `EmbeddingKind` for Phase 1 |

| 2026-04-16 | `46ad1441e` | Caffeine dep + `TextEmbeddingClient` stub for Phase 4 |

| 2026-04-16 | `a9f2ae876` | `EmbeddingIndexService` — production chunk-embedding data layer |

| 2026-04-17 | `c0dabd53f` | `DenseRetriever` + placeholder `ChunkVectorIndex` |

| 2026-04-17 | `c169604ce` | `HybridFuser` for weighted RRF of BM25 and dense lists |

| 2026-04-17 | `6b6527572` | `PageAggregation` + `PageAggregator` |

| 2026-04-17 | `8717e533f` | `QueryEmbedderConfig`, `CircuitState`, metrics snapshot |

| 2026-04-17 | `4e3fe9921` | Hand-rolled CLOSED/OPEN/HALF_OPEN circuit breaker |

| 2026-04-17 | `4156d7a1f` | `QueryEmbedder` wraps embedding client with cache + timeout + breaker |

| 2026-04-17 | `e9bf62d1e` | `InMemoryChunkVectorIndex` for dense top-k |

| 2026-04-17 | `0825bb97b` | Ollama embedding client and model registry |

| 2026-04-18 | `5211ea391` | Retrieval experiment harness + first model-comparison reports |

| 2026-04-18 | `373a024a2` | `HybridConfig` with defaults matching the winning experiment |

| 2026-04-18 | `376cdb877` | Phase 3: hybrid retrieval core (PageAggregation, HybridFuser, DenseRetriever) |

| 2026-04-19 | `b6a86fba7` | Hybrid perf pass: parallelized embedding, incremental index, heading-aware context |

| 2026-04-19 | `c4447350d` | Release v1.1.6: hybrid retrieval, MCP access hardening, admin content ops |

| 2026-04-23 | `2acccf102` | KG-RAG Phase 1-3: unified embeddings, extractor pipeline, graph-aware rerank + chunker merge-forward 8 → 150 |

| 2026-04-23 | `c289fbdd7` | Standalone extract-CLI for Tomcat-less batch runs |

The two inflection points:

- **2026-04-18** — the experiment harness ran, qwen3-embedding-0.6b won,

`HybridConfig` was frozen with SUM_TOP_3 aggregation and dense-heavy

RRF. That's the moment "dense search" stopped being an experiment and

started being the production search path.

- **2026-04-23** — the Phase 3 graph rerank layer added mentions as a

new input signal on top of hybrid, and the chunker floor was raised

to make whole-corpus entity extraction operationally affordable. That

turns the harness's page-ranking story from "what's our best hybrid

score" into "what's our best hybrid score *plus* graph-derived

proximity boost" — and re-running the harness after the extractor

batch populates mentions is the next regression checkpoint.