Entity Resolution: Record Linkage and Deduplication

Entity Resolution (ER) is the task of identifying and merging records that refer to the same real-world entity across disparate datasets.

1. The Multi-Stage ER Pipeline

Comparing every record against every other ($O(N^2)$) is impossible for large datasets. ER systems use a hierarchical approach:

1. **Standardization:** Normalizing names (e.g., "Corp." $\to$ "Corporation"), addresses, and phone numbers.

2. **Blocking:** Partitioning the dataset into "blocks" using a shared key (e.g., Zip Code + first 3 letters of Last Name).

3. **Matching:** Calculating detailed similarity scores within blocks.

4. **Clustering:** Grouping matched pairs into single entities using algorithms like Connected Components or Hierarchical Clustering.

2. Advanced Indexing: Locality-Sensitive Hashing (LSH)

LSH is used to perform "Fuzzy Blocking" by hashing similar items into the same bucket with high probability.

* **MinHash LSH:** Efficient for Jaccard similarity. Documents are converted to $k$-shingles, then MinHashed.

* **Concrete Example:** To find duplicate customer records with a Jaccard similarity $> 0.8$, use MinHash with 100 permutations and a band size of 5. This reduces the search space by $1000\times$ compared to a full scan.

3. Probabilistic Matching: Fellegi-Sunter

The Fellegi-Sunter model assigns weights to field agreements based on their uniqueness.

* **m-probability:** Probability that a field matches given the records are a MATCH (high for accurate fields).

* **u-probability:** Probability that a field matches given the records are a NON-MATCH (low for unique fields like SSN).

* **Concrete Logic:** A match on "Last Name = Smith" provides low evidence (high $u$), while a match on "Social Security Number" provides high evidence (extremely low $u$).

4. Machine Learning for ER

Modern ER utilizes **Siamese Networks** to learn dense embeddings of records.

* **Architecture:** Two identical BERT-style encoders process Record A and Record B.

* **Loss Function:** Triplet Loss or Contrastive Loss, ensuring that matched records have a high **Cosine Similarity** ($> 0.9$) while non-matches are pushed apart in the vector space.

---

**See Also:**

- [Normalization And Denormalization](NormalizationAndDenormalization) — Preparing data for ER.

- [Embeddings In Gen AI](EmbeddingsInGenAI) — Using vectors for semantic matching.

- [Knowledge Graph Construction Pipeline](KnowledgeGraphConstructionPipeline) — Integrating resolved entities into a graph.