Knowledge Extraction: From Text to Triples
Knowledge Extraction (KE) is the multi-stage process of transforming unstructured text into structured, machine-readable facts, typically represented as triples: `(Subject, Predicate, Object)`.
1. Named Entity Recognition (NER)
The first step is identifying the "entities" (nodes).
* **Models:** Modern NER uses Transformer encoders (e.g., SpanBERT or RoBERTa-large).
* **BIO Tagging:** The standard sequence labeling format. `B-PER` (Begin Person), `I-PER` (Inside Person), `O` (Outside).
* **Entity Linking (EL):** Mapping "Apple" to the correct Wikidata ID (`Q312`) based on context (e.g., fruit vs. company).
2. Relation Extraction (RE)
The second step is identifying the "relationship" (edges).
* **Sentence-Level RE:** Predicting the relation between two entities in a single sentence.
* **Concrete Example:** From "Elon Musk founded SpaceX," the system extracts `(Elon Musk, founded, SpaceX)`.
* **Model Approach:** Concatenate the entity embeddings with the sentence embedding and use a classification head (e.g., `softmax`) over a set of known relations (`works_at`, `located_in`, `author_of`).
3. Event Extraction
Events are more complex than static relations; they include **triggers** and **arguments**.
* **Trigger:** The word indicating the event (e.g., "acquired").
* **Arguments:** The participants (e.g., Buyer, Target, Price, Date).
* **Schema:** `Acquisition(Buyer: Google, Target: Fitbit, Date: 2019)`.
4. LLM-Based Extraction
With Large Language Models, the multi-stage pipeline can be collapsed into a single prompt using **JSON Schema enforcement**.
* **Prompt Pattern:** *"Extract all company acquisitions from the text. Return a JSON list of objects with keys: buyer, target, year."*
* **Validation:** Use libraries like `instructor` or `pydantic` to validate that the LLM output conforms to the expected data types before saving to the Knowledge Graph.
---
**See Also:**
- [Knowledge Graph Construction Pipeline](KnowledgeGraphConstructionPipeline) — Assembling the extracted triples.
- [Natural Language Processing](NaturalLanguageProcessing) — The core linguistic toolset.
- [Embeddings In Gen AI](EmbeddingsInGenAI) — Vectorizing extracted entities.