Knowledge-Graph Extraction Benchmarks — April 2026

This page records the benchmark runs performed on 2026-04-23 while choosing the shipping configuration for the Phase 2 entity extractor and the save-time chunker. The goal was to pick a model/concurrency/chunk-size combination that could extract the full ~40k-chunk corpus in an operator-friendly wall-clock without swamping the review queue with noisy proposals.

The final choices in production:

Extractor backend: Ollama against inference.jakefear.com:11434
Model: gemma4-assist:latest
Concurrency: 2
Chunker merge-forward floor: 150 tokens (up from 8)

Everything below shows why — and why each of the alternatives that looked promising on paper lost on actually-measured numbers.

Corpus baseline

	Before rebuild	After rebuild
`kg_content_chunks`	39,264	23,256
`content_chunk_embeddings`	39,264	23,256
`chunk_entity_mentions`	344	0 (FK-cascaded on chunk delete)
Mean tokens / chunk	103	174
p50 / p95 tokens / chunk	77 / 261	166 / 335
Max tokens / chunk	1,963	1,963 (atomic blocks unchanged)
Embedding rebuild wall-clock	—	6m 18s

The chunker rebuild was driven by one configuration change: raising wikantik.chunker.merge_forward_tokens from 8 to 150. That change also exposed two previously-advertised-but-unwired knobs (target_tokens, min_tokens) which were removed from the ContentChunker.Config record.

Summary of extractor runs

All runs use the same system prompt, same chunker output (with differing chunk sizes between "old" and "new"), and the same post-processing (chunk_entity_mentions upsert, kg_proposals insert, kg_rejections suppression). Per-chunk RPC mean is the wall-clock latency of a single extraction call; effective s/chunk is per-chunk-RPC divided by concurrency.

Run	Model	Prompt	Concurrency	Chunks	Per-chunk RPC	Effective s/chunk	Full-corpus projection
1	gemma4-assist:latest	verbose	1	39k	8.5 s	8.5 s	≈ 93 h
2	qwen2.5:7b-instruct-q5_K_M	verbose	1	39k	14.2 s	14.2 s	≈ 155 h
3	qwen2.5:7b-instruct-q5_K_M	tightened	2	39k	18.8 s	9.4 s	≈ 104 h
4	gemma4:e2b	tightened	3	39k	20.2 s	6.7 s	≈ 73 h
5	gemma4-assist:latest	tightened	4	39k	41.0 s	10.3 s	≈ 112 h
6	gemma4-assist:latest	tightened	2	23k	25.3 s	12.7 s	≈ 82 h
7	gemma4-assist:latest	tightened	1	23k	13.6 s	13.6 s	≈ 88 h

Run 6 is the shipping configuration. Runs 4 and 3 are faster on the clock but lose to run 6 on quality (detailed below). Run 5's c=4 shows scaling has negative returns past c=2 on a 4060 Ti serving a 7–8B-class model — the GPU is already bandwidth-bound.

Proposal volume comparison

On the same page (AbstractAlgebra, 42 chunks in the old corpus / 27 in the new):

Model / prompt	Mentions	Proposals	Mentions:Proposals
gemma4-assist + verbose (run 1)	8	108	1 : 13
qwen2.5:7b + verbose (run 2)	1	213	1 : 213
qwen2.5:7b + tight (run 3)	2	158	1 : 79
gemma4:e2b + tight (run 4)	4	129	1 : 32
gemma4-assist + tight c=4 (run 5)	15	68	1 : 4.5
gemma4-assist + tight c=2 (run 6)	12	48	1 : 4

The tighter prompt reduced gemma4-assist's proposal-per-chunk rate from ~2.6 to ~1.3 while simultaneously doubling the mention rate. Roughly half the review queue to work through for equivalent coverage.

Quality observations — same page, different models

Sample of top entities emitted for AbstractAlgebra (first chunks):

gemma4-assist, verbose prompt (run 1)

rings, Ore rings, localization procedures, Ore condition, quantum groups,
C*-algebras, Ring theory, algebraic structures, integers, polynomials,
algebraic geometry, representation theory, non-commutative geometry,
categorical frameworks, geometric frameworks

All domain-appropriate named concepts. Reasoning grounded in specific chunk text.

qwen2.5:7b, verbose prompt (run 2)

Ring, Algebraic structures $(R, +, \x08oldsymbol{⋅})$,   ← LaTeX corruption
Integers ($\x08oldsymbol{ℤ}$),                            ← LaTeX corruption
AdvancedTechniques,                                        ← CamelCase phrase
SharedRigorousUnderstanding,                               ← phrase-as-entity
FoundationalObject,                                        ← meta-term
R, +, ⋅, R, abean,                                         ← operators, typo
ring, unital ring, non-unital ring

The \x08oldsymbol string is a JSON escape-character collision where the model's output of \b was decoded as the ASCII backspace control character. The abean entry is a hallucinated typo of "abelian". Both are pre-existing qwen JSON-mode issues on technical prose.

qwen2.5:7b, tightened prompt (run 3)

Ring theory, algebraic structures $(R, +, ⋅)$, arithmetic of integers ($ℤ$),
polynomials ($k[x]$), advanced techniques, foundational object,
R, +, ⋅, R, 0, Unity, FunctionalAnalysis (type=Article),
commutative rings, non-commutative rings

LaTeX corruption resolved. Still emits operators and single letters. FunctionalAnalysis correctly tagged as Article — the one win.

gemma4:e2b, tightened prompt (run 4)

Ring, R, +, ×, 0, a, b, c, FunctionalAnalysis, Rings, Commutativity,
Non-commutative, Ring, ideal, Prime Ideals, R, P, Spec(R), M

Single letters (a, b, c, R, M, P) appear frequently — the 2B model is the least discriminating about what qualifies as a named entity. Does correctly pick up Spec(R), Prime Ideals, FunctionalAnalysis.

gemma4-assist, tightened prompt (runs 5–7)

Quality is equivalent to run 1 but with the proposal-volume and mention-rate improvements from the tighter prompt.

Why concurrency doesn't help gemma4-assist

Adding concurrent requests to Ollama on a single 4060 Ti slows each in-flight request proportionally, because the GPU is bandwidth-bound on a 7–8B-class model. The net-net for gemma4-assist:

Concurrency	Effective s/chunk	Relative
c=1	13.6 s	baseline
c=2	12.7 s	7% faster
c=4	10.3 s (old-chunks), degraded on new chunks	modest gain then regression

Concurrency does help the 2B gemma4:e2b — there's enough VRAM headroom for the GPU scheduler to actually run requests in parallel. But the quality regression on small models is severe enough that we didn't ship it.

Per-page timing detail (first 10 pages, run 6)

Reference for anyone tuning a sample: first 11 pages of run 6 (shipping config, 23k-chunk corpus, gemma4-assist c=2):

Page	Chunks	Total (ms)	Per-chunk (ms)	Mentions	Proposals
2026IranWar	4	73,691	18,423	8	27
AbstractAlgebra	27	322,274	11,936	12	48
AcceleratingAiLearning	13	140,737	10,826	8	6
AccountTypeStrategy	12	165,757	13,813	12	24
AcidTransactionsAndIsolation	29	493,587	17,020	1	77
ActorModelProgramming	26	322,939	12,421	8	43
AdapterPattern	28	370,073	13,217	18	45
AdjustmentOfStatusProcess	13	130,194	10,015	7	21
AdminSecurityUi	2	30,972	15,486	2	8
AdvancedSkillPatterns	24	251,145	10,464	3	22
AdventureTravelPlanning	23	293,852	12,776	2	39

Chunker rebuild — search-ranking impact

Before the full extractor run we verified that the 39k → 23k chunker rebuild didn't regress search quality. The test compared top-10 results for the same queries before and after the rebuild, with graph rerank disabled (no mentions populated yet).

Methodology

For each query, capture top-10 page names from /api/search against the old corpus (39k chunks, stored baseline from 07:25 local), and again against the new corpus (23k chunks, captured immediately after the rebuild completed). Compare set overlap and ordering.

Results — `"knowledge graph"` top 10

Rank	Old chunks	New chunks
1	WikantikKnowledgeGraphAdmin	InventionOfKnowledgeGraph
2	InventionOfKnowledgeGraph	WikantikKnowledgeGraphAdmin
3	KnowledgeGraphCore	KnowledgeGraphCore
4	KnowledgeGraphDogfooding	KnowledgeGraphVsRelationalDatabase
5	KnowledgeGraphVsRelationalDatabase	GraphRAG
6	GraphRAG	KnowledgeGraphsAndManagement
7	IndustrialKnowledgeGraphUseCases	IndustrialKnowledgeGraphUseCases
8	KnowledgeGraphsAndManagement	FederatedKnowledgeGraphs
9	KnowledgeGraphCompletion	KnowledgeGraphCompletion
10	FederatedKnowledgeGraphs	KnowledgeGraphConstructionPipeline

Set overlap: 8 of 10 preserved. Dropped: KnowledgeGraphDogfooding. Added: KnowledgeGraphConstructionPipeline. Positions 1–2 swapped; both are highly relevant.

Results — `"GraphRAG"` top 10

Rank	Old chunks	New chunks
1	GraphRAG	GraphRAG
2	KnowledgeGraphsAndManagement	KnowledgeGraphsAndManagement
3	KnowledgeGraphsAndGenAIWorkflows	KnowledgeGraphsAndGenAIWorkflows
4	InventionOfKnowledgeGraph	InventionOfKnowledgeGraph
5	RagImplementationPatterns	RagImplementationPatterns
6	AdvancedSearchTermEngineering	AdvancedSearchTermEngineering
7	IndustrialKnowledgeGraphUseCases	IndustrialKnowledgeGraphUseCases
8	AiFunctionCallingAndToolUse	AiMemoryAndPersistence
9	WikantikKnowledgeGraphAdmin	ResourceDescriptionFramework
10	ResourceDescriptionFramework	WikantikKnowledgeGraphAdmin

Set overlap: 8 of 10 preserved. Top 7 identical. Dropped: AiFunctionCallingAndToolUse. Added: AiMemoryAndPersistence.

Interpretation

BM25 is mostly unaffected — Lucene indexes full pages, not chunks, so the BM25 half of the hybrid score doesn't move with chunk boundaries.
Dense retrieval sees the bigger chunks. A chunk at 174 tokens averages a slightly less peaky embedding than one at 103 tokens, so chunks that were narrowly winning on a laser-focused topic sentence can slip a rank or two; chunks now containing multiple related sub-topics can win queries they previously missed. Net effect: ~80% overlap in top-10, slight topology shuffle.
Graph rerank remains a no-op until mentions are populated. The ranking improvement this stack was designed for has to wait on the full extractor run.

Takeaways

Quality dominates throughput on this corpus. A "perfect throughput, low signal-to-noise" model is worse than "modest throughput, high signal-to-noise" because the admin review queue is the real bottleneck. Shipping gemma4-assist cost about 12 % more wall-clock than gemma4:e2b but produces about half the review noise.
Prompt tightening beat model swapping. The same gemma4-assist model went from 2.6 proposals/chunk (verbose) to 1.3 (tight) while doubling its mention rate. No model change matched that on signal-to-noise.
Concurrency is model-dependent. c=2 helps 7–8B models marginally (~7 %); c≥3 regresses. Small 2B models scale to c=3 with real gains. One knob doesn't fit all.
Chunker floor is the biggest lever we pulled. 39k → 23k chunks is a 41 % reduction in extraction RPCs for ~0 search-quality cost, against a configuration change that took seconds.
Claude Haiku 4.5 remains the "definitely finishes in a workday" fallback at ≈ $75 and ≈ 25 h for the full corpus, if the ~82 h local-only projection ever becomes operationally unacceptable.