Running the Judge Experiment Harness

Phase 6 of the KG-extraction redesign added an opt-in proposal-judge stage that runs after ProposalConsolidator and before the kg_proposals upsert. The harness in this runbook lets you preview what enabling the judge would actually do to the queue — without flipping the production default.

When to use this runbook

Three concrete situations:

The pending kg_proposals queue is large and you want to know how much a real judge would prune.
A new judge model has appeared (newer Qwen, newer Claude, a local fine-tune) and you want to compare it against the current default.
You are about to flip the production extractor's --judge flag from none to a real judge and want a calibration check first.

Quick start

# Cheap, local — about 25–30 seconds per proposal, model-dependent.
# Default --judge-model is gemma4-assist:latest; pass --judge-model to override.
bin/kg-judge-experiment.sh \
    --judge ollama \
    --sample 30 \
    --output reports/judge-gemma.json

# Gated, billed — only after the local one looks promising.
export ANTHROPIC_API_KEY=sk-…
bin/kg-judge-experiment.sh \
    --judge claude \
    --judge-model claude-haiku-4-5 \
    --anthropic-key-env ANTHROPIC_API_KEY \
    --sample 30 \
    --output reports/judge-claude.json

The script:

rebuilds wikantik-extract-cli/target/wikantik-extract-cli.jar if the jar is missing or older than any source under wikantik-extract-cli/src or wikantik-main/src/main/java/com/wikantik/knowledge/extraction/;
pulls JDBC creds from tomcat/tomcat-11/conf/Catalina/localhost/ROOT.xml (falls back to PG_JDBC_URL / PG_USER / PG_PASSWORD);
forwards every other flag straight through to JudgeExperimentCli, so --help shows the full flag set.

Reading the report

The JSON output has three top-level groups:

noopVerdicts — what the production default (NoOpProposalJudge) would have done. Always accepted=N, rejected=0. This is the baseline.
comparatorVerdicts — what the requested judge actually did. Counts: accepted, rejected, rewritten, judge_failed_accepts (the fail-open accepts caused by HTTP errors / parse failures — these are not real verdicts), and reject_reasons keyed by the closed-enum reason code.
examples — the full per-row diff. Each row has the signature, the proposal's kind / displayName / type (or edge source/target/predicate), the verdicts from both judges, and the comparator's rationale string.

A typical accept-rate read:

jq '.comparatorVerdicts | "accept=\(.accepted) reject=\(.rejected) " +
    "judge_failed=\(.judge_failed_accepts) reasons=\(.reject_reasons)"' \
   reports/judge-qwen.json

Comparing two judge models

The harness only takes one comparator per run. To compare models, run twice with different --judge-model / --output:

bin/kg-judge-experiment.sh --judge ollama --judge-model gemma4-assist:latest \
    --sample 30 --output reports/judge-gemma.json
bin/kg-judge-experiment.sh --judge ollama --judge-model qwen3.5:9b \
    --timeout-ms 180000 \
    --sample 30 --output reports/judge-qwen.json
diff <(jq -S '.comparatorVerdicts' reports/judge-gemma.json) \
     <(jq -S '.comparatorVerdicts' reports/judge-qwen.json)

Two runs see different rows (the sample is ORDER BY random() and not seeded), so per-row comparisons across models are noise — read the aggregates and the reject-reason histogram. With --sample 30+, the relative accept/reject rates are stable enough to drive a default-flip decision; below that, variance dominates.

Observed model behaviour (2026-05-02)

A 30-row comparison on the local inference.jakefear.com Ollama endpoint at the default 60s per-call timeout produced:

Model	Real accepts	Real rejects	`judge_failed` (timeouts)
`qwen3.5:9b`	4	7	19 / 30
`gemma4-assist:latest`	23	7	0 / 30

Same rejection rate (7/30 ≈ 23%); rationales from both models look sound on the rejects (caught genuinely ungrounded claims like "Ralph Gomory" and "Vision Pro" with no supporting page text). The distinguishing factor is reliability: qwen3.5:9b consistently exceeds 60s on real proposals on this endpoint and produces unusable runs at default settings. If you want to use it anyway, pass --timeout-ms 180000 (or larger) and re-test before drawing quality conclusions.

This evidence drove the 2026-05-02 default flip in BootstrapExtractionCli.Args.judgeModel and JudgeExperimentCli.Args.judgeModel from qwen3.5:9b to gemma4-assist:latest. The same-model self-judging concern (the page extractor also uses gemma4-assist:latest) is a real but distant second to a 63% timeout rate; revisit if a future Ollama deployment makes qwen reliable at the default budget.

Calibration before flipping the production default

Run accuracy is the bar — not run latency. Before flipping BootstrapExtractionCli's default --judge from none to a real judge, hand-label a sample of ~50 pending rows (drawn the same way the harness draws them) and compare the labels against the judge's verdicts. The design's "do not flip" threshold is judge-vs-human accuracy below ~80% on any single reject reason; the most common failure mode at the time of writing is the judge over-rejecting sparsely-supported but legitimate domain entities as weak_support.

Pitfalls

The frontmatter pitfalls capture the recurring methodology mistakes. The common one is reading comparatorVerdicts.accepted without separating judge_failed_accepts — fail-open accepts inflate the apparent accept rate and hide quality problems.

Where the code lives

CLI entry point — wikantik-extract-cli/src/main/java/com/wikantik/extractcli/JudgeExperimentCli.java
Shell wrapper — bin/kg-judge-experiment.sh
Judge implementations — wikantik-main/src/main/java/com/wikantik/knowledge/extraction/{Ollama,Claude}ProposalJudge.java
Production wiring (--judge switch) — wikantik-extract-cli/src/main/java/com/wikantik/extractcli/BootstrapExtractionCli.java