Knowledge-Graph Extraction Benchmarks — April 2026

This page records the benchmark runs performed on 2026-04-23 while choosing

the shipping configuration for the Phase 2 entity extractor and the

save-time chunker. The goal was to pick a model/concurrency/chunk-size

combination that could extract the full ~40k-chunk corpus in an operator-friendly

wall-clock without swamping the review queue with noisy proposals.

The final choices in production:

- Extractor backend: **Ollama** against `inference.jakefear.com:11434`

- Model: **`gemma4-assist:latest`**

- Concurrency: **2**

- Chunker merge-forward floor: **150 tokens** (up from 8)

Everything below shows why — and why each of the alternatives that looked

promising on paper lost on actually-measured numbers.

Corpus baseline

| | Before rebuild | After rebuild |

|---|---|---|

| `kg_content_chunks` | 39,264 | 23,256 |

| `content_chunk_embeddings` | 39,264 | 23,256 |

| `chunk_entity_mentions` | 344 | 0 (FK-cascaded on chunk delete) |

| Mean tokens / chunk | 103 | 174 |

| p50 / p95 tokens / chunk | 77 / 261 | 166 / 335 |

| Max tokens / chunk | 1,963 | 1,963 (atomic blocks unchanged) |

| Embedding rebuild wall-clock | — | 6m 18s |

The chunker rebuild was driven by one configuration change: raising

`wikantik.chunker.merge_forward_tokens` from 8 to 150. That change also

exposed two previously-advertised-but-unwired knobs (`target_tokens`,

`min_tokens`) which were removed from the `ContentChunker.Config` record.

Summary of extractor runs

All runs use the same system prompt, same chunker output (with differing

chunk sizes between "old" and "new"), and the same post-processing

(`chunk_entity_mentions` upsert, `kg_proposals` insert, `kg_rejections`

suppression). Per-chunk RPC mean is the wall-clock latency of a single

extraction call; effective s/chunk is per-chunk-RPC divided by concurrency.

| Run | Model | Prompt | Concurrency | Chunks | Per-chunk RPC | Effective s/chunk | Full-corpus projection |

|---|---|---|---|---|---|---|---|

| 1 | gemma4-assist:latest | verbose | 1 | 39k | 8.5 s | 8.5 s | ≈ 93 h |

| 2 | qwen2.5:7b-instruct-q5_K_M | verbose | 1 | 39k | 14.2 s | 14.2 s | ≈ 155 h |

| 3 | qwen2.5:7b-instruct-q5_K_M | tightened | 2 | 39k | 18.8 s | 9.4 s | ≈ 104 h |

| 4 | gemma4:e2b | tightened | 3 | 39k | 20.2 s | 6.7 s | ≈ 73 h |

| 5 | gemma4-assist:latest | tightened | 4 | 39k | 41.0 s | 10.3 s | ≈ 112 h |

| **6** | **gemma4-assist:latest** | **tightened** | **2** | **23k** | **25.3 s** | **12.7 s** | **≈ 82 h** |

| 7 | gemma4-assist:latest | tightened | 1 | 23k | 13.6 s | 13.6 s | ≈ 88 h |

**Run 6 is the shipping configuration.** Runs 4 and 3 are faster on the

clock but lose to run 6 on quality (detailed below). Run 5's c=4 shows

scaling has negative returns past c=2 on a 4060 Ti serving a 7–8B-class

model — the GPU is already bandwidth-bound.

Proposal volume comparison

On the same page (AbstractAlgebra, 42 chunks in the old corpus / 27 in the

new):

| Model / prompt | Mentions | Proposals | Mentions:Proposals |

|---|---|---|---|

| gemma4-assist + verbose (run 1) | 8 | 108 | 1 : 13 |

| qwen2.5:7b + verbose (run 2) | 1 | 213 | 1 : 213 |

| qwen2.5:7b + tight (run 3) | 2 | 158 | 1 : 79 |

| gemma4:e2b + tight (run 4) | 4 | 129 | 1 : 32 |

| gemma4-assist + tight c=4 (run 5) | 15 | 68 | 1 : 4.5 |

| gemma4-assist + tight c=2 (run 6) | 12 | 48 | 1 : 4 |

The tighter prompt reduced gemma4-assist's proposal-per-chunk rate from

~2.6 to ~1.3 while simultaneously doubling the mention rate. Roughly half

the review queue to work through for equivalent coverage.

Quality observations — same page, different models

Sample of top entities emitted for **AbstractAlgebra** (first chunks):

**gemma4-assist, verbose prompt (run 1)**

```

rings, Ore rings, localization procedures, Ore condition, quantum groups,

C*-algebras, Ring theory, algebraic structures, integers, polynomials,

algebraic geometry, representation theory, non-commutative geometry,

categorical frameworks, geometric frameworks

```

All domain-appropriate named concepts. Reasoning grounded in specific

chunk text.

**qwen2.5:7b, verbose prompt (run 2)**

```

Ring, Algebraic structures $(R, +, \x08oldsymbol{⋅})$, ← LaTeX corruption

Integers ($\x08oldsymbol{ℤ}$), ← LaTeX corruption

AdvancedTechniques, ← CamelCase phrase

SharedRigorousUnderstanding, ← phrase-as-entity

FoundationalObject, ← meta-term

R, +, ⋅, R, abean, ← operators, typo

ring, unital ring, non-unital ring

```

The `\x08oldsymbol` string is a JSON escape-character collision where

the model's output of `\b` was decoded as the ASCII backspace control

character. The `abean` entry is a hallucinated typo of "abelian". Both

are pre-existing qwen JSON-mode issues on technical prose.

**qwen2.5:7b, tightened prompt (run 3)**

```

Ring theory, algebraic structures $(R, +, ⋅)$, arithmetic of integers ($ℤ$),

polynomials ($k[x]$), advanced techniques, foundational object,

R, +, ⋅, R, 0, Unity, FunctionalAnalysis (type=Article),

commutative rings, non-commutative rings

```

LaTeX corruption resolved. Still emits operators and single letters.

`FunctionalAnalysis` correctly tagged as `Article` — the one win.

**gemma4:e2b, tightened prompt (run 4)**

```

Ring, R, +, ×, 0, a, b, c, FunctionalAnalysis, Rings, Commutativity,

Non-commutative, Ring, ideal, Prime Ideals, R, P, Spec(R), M

```

Single letters (`a`, `b`, `c`, `R`, `M`, `P`) appear frequently — the 2B

model is the least discriminating about what qualifies as a named entity.

Does correctly pick up `Spec(R)`, `Prime Ideals`, `FunctionalAnalysis`.

**gemma4-assist, tightened prompt (runs 5–7)**

Quality is equivalent to run 1 but with the proposal-volume and

mention-rate improvements from the tighter prompt.

Why concurrency doesn't help gemma4-assist

Adding concurrent requests to Ollama on a single 4060 Ti slows each

in-flight request proportionally, because the GPU is bandwidth-bound on a

7–8B-class model. The net-net for gemma4-assist:

| Concurrency | Effective s/chunk | Relative |

|---|---|---|

| c=1 | 13.6 s | baseline |

| c=2 | 12.7 s | **7% faster** |

| c=4 | 10.3 s (old-chunks), degraded on new chunks | modest gain then regression |

Concurrency does help the 2B `gemma4:e2b` — there's enough VRAM headroom

for the GPU scheduler to actually run requests in parallel. But the

quality regression on small models is severe enough that we didn't ship

it.

Per-page timing detail (first 10 pages, run 6)

Reference for anyone tuning a sample: first 11 pages of run 6 (shipping

config, 23k-chunk corpus, gemma4-assist c=2):

| Page | Chunks | Total (ms) | Per-chunk (ms) | Mentions | Proposals |

|---|---|---|---|---|---|

| 2026IranWar | 4 | 73,691 | 18,423 | 8 | 27 |

| AbstractAlgebra | 27 | 322,274 | 11,936 | 12 | 48 |

| AcceleratingAiLearning | 13 | 140,737 | 10,826 | 8 | 6 |

| AccountTypeStrategy | 12 | 165,757 | 13,813 | 12 | 24 |

| AcidTransactionsAndIsolation | 29 | 493,587 | 17,020 | 1 | 77 |

| ActorModelProgramming | 26 | 322,939 | 12,421 | 8 | 43 |

| AdapterPattern | 28 | 370,073 | 13,217 | 18 | 45 |

| AdjustmentOfStatusProcess | 13 | 130,194 | 10,015 | 7 | 21 |

| AdminSecurityUi | 2 | 30,972 | 15,486 | 2 | 8 |

| AdvancedSkillPatterns | 24 | 251,145 | 10,464 | 3 | 22 |

| AdventureTravelPlanning | 23 | 293,852 | 12,776 | 2 | 39 |

Chunker rebuild — search-ranking impact

Before the full extractor run we verified that the 39k → 23k chunker

rebuild didn't regress search quality. The test compared top-10 results

for the same queries before and after the rebuild, with graph rerank

disabled (no mentions populated yet).

Methodology

For each query, capture top-10 page names from `/api/search` against the

old corpus (39k chunks, stored baseline from 07:25 local), and again

against the new corpus (23k chunks, captured immediately after the

rebuild completed). Compare set overlap and ordering.

Results — `"knowledge graph"` top 10

| Rank | Old chunks | New chunks |

|---|---|---|

| 1 | WikantikKnowledgeGraphAdmin | InventionOfKnowledgeGraph |

| 2 | InventionOfKnowledgeGraph | WikantikKnowledgeGraphAdmin |

| 3 | KnowledgeGraphCore | KnowledgeGraphCore |

| 4 | KnowledgeGraphDogfooding | KnowledgeGraphVsRelationalDatabase |

| 5 | KnowledgeGraphVsRelationalDatabase | GraphRAG |

| 6 | GraphRAG | KnowledgeGraphsAndManagement |

| 7 | IndustrialKnowledgeGraphUseCases | IndustrialKnowledgeGraphUseCases |

| 8 | KnowledgeGraphsAndManagement | FederatedKnowledgeGraphs |

| 9 | KnowledgeGraphCompletion | KnowledgeGraphCompletion |

| 10 | FederatedKnowledgeGraphs | KnowledgeGraphConstructionPipeline |

Set overlap: **8 of 10** preserved. Dropped: `KnowledgeGraphDogfooding`.

Added: `KnowledgeGraphConstructionPipeline`. Positions 1–2 swapped; both

are highly relevant.

Results — `"GraphRAG"` top 10

| Rank | Old chunks | New chunks |

|---|---|---|

| 1 | GraphRAG | GraphRAG |

| 2 | KnowledgeGraphsAndManagement | KnowledgeGraphsAndManagement |

| 3 | KnowledgeGraphsAndGenAIWorkflows | KnowledgeGraphsAndGenAIWorkflows |

| 4 | InventionOfKnowledgeGraph | InventionOfKnowledgeGraph |

| 5 | RagImplementationPatterns | RagImplementationPatterns |

| 6 | AdvancedSearchTermEngineering | AdvancedSearchTermEngineering |

| 7 | IndustrialKnowledgeGraphUseCases | IndustrialKnowledgeGraphUseCases |

| 8 | AiFunctionCallingAndToolUse | AiMemoryAndPersistence |

| 9 | WikantikKnowledgeGraphAdmin | ResourceDescriptionFramework |

| 10 | ResourceDescriptionFramework | WikantikKnowledgeGraphAdmin |

Set overlap: **8 of 10** preserved. Top 7 identical. Dropped:

`AiFunctionCallingAndToolUse`. Added: `AiMemoryAndPersistence`.

Interpretation

- **BM25 is mostly unaffected** — Lucene indexes full pages, not chunks,

so the BM25 half of the hybrid score doesn't move with chunk

boundaries.

- **Dense retrieval sees the bigger chunks.** A chunk at 174 tokens

averages a slightly less peaky embedding than one at 103 tokens, so

chunks that were narrowly winning on a laser-focused topic sentence

can slip a rank or two; chunks now containing multiple related

sub-topics can win queries they previously missed. Net effect: ~80%

overlap in top-10, slight topology shuffle.

- **Graph rerank remains a no-op** until mentions are populated. The

ranking improvement this stack was designed for has to wait on the

full extractor run.

Takeaways

1. **Quality dominates throughput** on this corpus. A "perfect throughput,

low signal-to-noise" model is worse than "modest throughput, high

signal-to-noise" because the admin review queue is the real bottleneck.

Shipping gemma4-assist cost about 12 % more wall-clock than gemma4:e2b

but produces about half the review noise.

2. **Prompt tightening beat model swapping.** The same gemma4-assist

model went from 2.6 proposals/chunk (verbose) to 1.3 (tight) while

doubling its mention rate. No model change matched that on

signal-to-noise.

3. **Concurrency is model-dependent.** c=2 helps 7–8B models marginally

(~7 %); c≥3 regresses. Small 2B models scale to c=3 with real gains.

One knob doesn't fit all.

4. **Chunker floor is the biggest lever we pulled.** 39k → 23k chunks is

a 41 % reduction in extraction RPCs for ~0 search-quality cost,

against a configuration change that took seconds.

5. **Claude Haiku 4.5 remains the "definitely finishes in a workday"

fallback** at ≈ $75 and ≈ 25 h for the full corpus, if the ~82 h

local-only projection ever becomes operationally unacceptable.

Further reading

- [docs/KnowledgeGraphRerank.md](KnowledgeGraphRerank) — configuration and

tuning guide.

- [docs/superpowers/plans/2026-04-22-kg-rag-uplift.md](KgRagUpliftPlan) —

the original three-phase plan this benchmarks.

- `wikantik-extract-cli/` — standalone CLI that runs the batch without a

Tomcat instance, added so long runs survive local-development

restarts.