Running the Judge Experiment Harness

Phase 6 of the KG-extraction redesign added an opt-in proposal-judge

stage that runs after `ProposalConsolidator` and before the

`kg_proposals` upsert. The harness in this runbook lets you preview

what enabling the judge would actually do to the queue — without

flipping the production default.

When to use this runbook

Three concrete situations:

1. The pending `kg_proposals` queue is large and you want to know how

much a real judge would prune.

2. A new judge model has appeared (newer Qwen, newer Claude, a local

fine-tune) and you want to compare it against the current default.

3. You are about to flip the production extractor's `--judge` flag

from `none` to a real judge and want a calibration check first.

Quick start

```bash

Cheap, local — about 25–30 seconds per proposal, model-dependent.

Default --judge-model is gemma4-assist:latest; pass --judge-model to override.

bin/kg-judge-experiment.sh \

--judge ollama \

--sample 30 \

--output reports/judge-gemma.json

Gated, billed — only after the local one looks promising.

export ANTHROPIC_API_KEY=sk-…

bin/kg-judge-experiment.sh \

--judge claude \

--judge-model claude-haiku-4-5 \

--anthropic-key-env ANTHROPIC_API_KEY \

--sample 30 \

--output reports/judge-claude.json

```

The script:

- rebuilds `wikantik-extract-cli/target/wikantik-extract-cli.jar` if the

jar is missing or older than any source under

`wikantik-extract-cli/src` or

`wikantik-main/src/main/java/com/wikantik/knowledge/extraction/`;

- pulls JDBC creds from

`tomcat/tomcat-11/conf/Catalina/localhost/ROOT.xml` (falls back to

`PG_JDBC_URL` / `PG_USER` / `PG_PASSWORD`);

- forwards every other flag straight through to `JudgeExperimentCli`,

so `--help` shows the full flag set.

Reading the report

The JSON output has three top-level groups:

- **`noopVerdicts`** — what the production default

(`NoOpProposalJudge`) would have done. Always `accepted=N,

rejected=0`. This is the baseline.

- **`comparatorVerdicts`** — what the requested judge actually did.

Counts: `accepted`, `rejected`, `rewritten`, `judge_failed_accepts`

(the fail-open accepts caused by HTTP errors / parse failures —

these are *not* real verdicts), and `reject_reasons` keyed by the

closed-enum reason code.

- **`examples`** — the full per-row diff. Each row has the

`signature`, the proposal's `kind` / `displayName` / `type` (or

edge `source`/`target`/`predicate`), the verdicts from both judges,

and the comparator's rationale string.

A typical accept-rate read:

```bash

jq '.comparatorVerdicts | "accept=\(.accepted) reject=\(.rejected) " +

"judge_failed=\(.judge_failed_accepts) reasons=\(.reject_reasons)"' \

reports/judge-qwen.json

```

Comparing two judge models

The harness only takes one comparator per run. To compare models, run

twice with different `--judge-model` / `--output`:

```bash

bin/kg-judge-experiment.sh --judge ollama --judge-model gemma4-assist:latest \

--sample 30 --output reports/judge-gemma.json

bin/kg-judge-experiment.sh --judge ollama --judge-model qwen3.5:9b \

--timeout-ms 180000 \

--sample 30 --output reports/judge-qwen.json

diff <(jq -S '.comparatorVerdicts' reports/judge-gemma.json) \

<(jq -S '.comparatorVerdicts' reports/judge-qwen.json)

```

Two runs see *different* rows (the sample is `ORDER BY random()` and

not seeded), so per-row comparisons across models are noise — read

the aggregates and the reject-reason histogram. With `--sample 30+`,

the relative accept/reject rates are stable enough to drive a

default-flip decision; below that, variance dominates.

Observed model behaviour (2026-05-02)

A 30-row comparison on the local `inference.jakefear.com` Ollama

endpoint at the default 60s per-call timeout produced:

| Model | Real accepts | Real rejects | `judge_failed` (timeouts) |

|-------------------------|-------------:|-------------:|--------------------------:|

| `qwen3.5:9b` | 4 | 7 | **19 / 30** |

| `gemma4-assist:latest` | 23 | 7 | 0 / 30 |

Same rejection rate (7/30 ≈ 23%); rationales from both models look

sound on the rejects (caught genuinely ungrounded claims like "Ralph

Gomory" and "Vision Pro" with no supporting page text). The

distinguishing factor is reliability: qwen3.5:9b consistently exceeds

60s on real proposals on this endpoint and produces unusable runs at

default settings. If you want to use it anyway, pass

`--timeout-ms 180000` (or larger) and re-test before drawing

quality conclusions.

This evidence drove the 2026-05-02 default flip in

`BootstrapExtractionCli.Args.judgeModel` and

`JudgeExperimentCli.Args.judgeModel` from `qwen3.5:9b` to

`gemma4-assist:latest`. The same-model self-judging concern (the page

extractor also uses `gemma4-assist:latest`) is a real but distant

second to a 63% timeout rate; revisit if a future Ollama deployment

makes qwen reliable at the default budget.

Calibration before flipping the production default

Run accuracy is the bar — not run latency. Before flipping

`BootstrapExtractionCli`'s default `--judge` from `none` to a real

judge, hand-label a sample of ~50 pending rows (drawn the same way

the harness draws them) and compare the labels against the judge's

verdicts. The design's "do not flip" threshold is judge-vs-human

accuracy below ~80% on any single reject reason; the most common

failure mode at the time of writing is the judge over-rejecting

sparsely-supported but legitimate domain entities as `weak_support`.

Pitfalls

The frontmatter `pitfalls` capture the recurring methodology

mistakes. The common one is reading

`comparatorVerdicts.accepted` without separating

`judge_failed_accepts` — fail-open accepts inflate the apparent

accept rate and hide quality problems.

Where the code lives

- CLI entry point —

`wikantik-extract-cli/src/main/java/com/wikantik/extractcli/JudgeExperimentCli.java`

- Shell wrapper — `bin/kg-judge-experiment.sh`

- Judge implementations —

`wikantik-main/src/main/java/com/wikantik/knowledge/extraction/{Ollama,Claude}ProposalJudge.java`

- Production wiring (`--judge` switch) —

`wikantik-extract-cli/src/main/java/com/wikantik/extractcli/BootstrapExtractionCli.java`