Knowledge Graph Completion

Knowledge Graph Completion (KGC) is the task of inferring missing triples `(s, r, o)` in a graph. In a production Wikantik instance, this isn't an academic exercise; it's the mechanism that turns a sparse set of extracted entities into a dense reasoning substrate for agents.

A complete KG allows an agent to answer "What is the security posture of the authentication service?" even if no single document explicitly links `AuthenticationService` to `OAuth2`.

1. The Geometry of Link Prediction

Link prediction assumes that entities and relations can be mapped to a continuous vector space where the truth of a triple is proportional to a score function $f_r(s, o)$.

Translational Distance Models (TransE, RotatE)

In **TransE**, the relation is a translation vector:$\mathbf{s} + \mathbf{r} \approx \mathbf{o}$.

* **Failure mode:** Cannot handle 1-to-N relations. If `(USA, has_state, NewYork)` and `(USA, has_state, California)`, TransE forces `NewYork` and `California` to the same vector.

* **Production Fix (RotatE):** Maps entities to complex vectors$\mathbb{C}^d$and relations to rotations:$\mathbf{o} = \mathbf{s} \circ \mathbf{r}$, where$|\mathbf{r}_i| = 1$. This handles symmetry, antisymmetry, and inversion.

Bilinear Models (ComplEx)

**ComplEx** uses the Hermitian dot product in complex space:$$f_r(s, o) = \text{Re}(\langle \mathbf{w}_r, \mathbf{e}_s, \bar{\mathbf{e}}_o \rangle)$$This is the state-of-the-art baseline for large-scale KGs because it scales linearly with entity count and captures asymmetric relations (e.g., `parent_of`) effectively.

2. LLM-Augmented Extraction (The Production Path)

While embedding models predict links from *existing* structure, LLMs extract links from *unstructured evidence*. In Wikantik, we use a verification loop:

```python

def verify_extracted_triple(subject, relation, obj, context_chunk):

1. Structural Check

if not kg.has_relation_type(relation):

return False, "INVALID_RELATION"

2. Embedding Consensus (using RotatE score)

score = kg_embedding_model.score(subject, relation, obj)

if score < THRESHOLD_ANOMALY:

If the model is shocked by this triple, require higher LLM confidence

min_confidence = 0.95

else:

min_confidence = 0.70

3. LLM Multi-Pass Verification

return llm.verify(

f"Does '{context_chunk}' prove ({subject}, {relation}, {obj})?",

min_confidence=min_confidence

)

```

3. Evaluation: Moving Beyond Hits@10

Standard academic benchmarks (FB15k-237) are often leaked into LLM training sets. For production KGC, you must measure:

| Metric | Why it matters | Calculation |

|---|---|---|

| **MRR (Mean Reciprocal Rank)** | Rewards the model for putting the truth at #1 vs #10. |$\frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{rank_i}$|

| **Filtered Hits@1** | Strict accuracy. "Filtered" means we don't penalize the model for picking a *different* true triple that isn't the ground truth for this specific test case. | Count of true positives at rank 1 / total queries |

| **Relation-Specific Precision** | Some relations (e.g., `is_a`) are easier than others (`impacts`). |$\frac{TP}{TP + FP}$ per relation type |

**Critical Trap:** Beware of "Entity Leakage." If your training set contains `(A, part_of, B)` and your test set contains `(B, contains, A)`, a simple model will "predict" the link via inversion without understanding the semantics.

4. Implementation Checklist

1. **Entity Resolution First:** If `OpenAI` and `OpenAI Inc.` are separate nodes, KGC will fail. Perform hard-string normalization and LLM-based fuzzy matching before training embeddings.

2. **Negative Sampling:** To train, you need "false" triples. Generate these by corrupting true triples (replace `o` with a random entity `o'`).

3. **Graph Neural Networks (CompGCN):** If your graph has high-order dependencies (A affects B, B affects C, therefore A affects C), use a GCN layer to propagate features before the scoring function.

4. **Schema Enforcement:** Never allow an LLM to invent a relation type. Use a `RelationRegistry` to map LLM-produced strings to canonical IDs.

Further Reading

* [EntityResolutionTechniques](EntityResolutionTechniques)

* [GraphRAG](GraphRAG)

* [KnowledgeGraphVsRelationalDatabase](KnowledgeGraphVsRelationalDatabase)