Entity Resolution: Record Linkage and Deduplication

Entity Resolution (ER) is the task of identifying and merging records that refer to the same real-world entity across disparate datasets.

1. The Multi-Stage ER Pipeline

Comparing every record against every other ( $O(N^2)$ ) is impossible for large datasets. ER systems use a hierarchical approach:

Standardization: Normalizing names (e.g., "Corp." $\to$ "Corporation"), addresses, and phone numbers.
Blocking: Partitioning the dataset into "blocks" using a shared key (e.g., Zip Code + first 3 letters of Last Name).
Matching: Calculating detailed similarity scores within blocks.
Clustering: Grouping matched pairs into single entities using algorithms like Connected Components or Hierarchical Clustering.

2. Advanced Indexing: Locality-Sensitive Hashing (LSH)

LSH is used to perform "Fuzzy Blocking" by hashing similar items into the same bucket with high probability.

MinHash LSH: Efficient for Jaccard similarity. Documents are converted to $k$ -shingles, then MinHashed.
Concrete Example: To find duplicate customer records with a Jaccard similarity $> 0.8$ , use MinHash with 100 permutations and a band size of 5. This reduces the search space by $1000\times$ compared to a full scan.

3. Probabilistic Matching: Fellegi-Sunter

The Fellegi-Sunter model assigns weights to field agreements based on their uniqueness.

m-probability: Probability that a field matches given the records are a MATCH (high for accurate fields).
u-probability: Probability that a field matches given the records are a NON-MATCH (low for unique fields like SSN).
Concrete Logic: A match on "Last Name = Smith" provides low evidence (high $u$ ), while a match on "Social Security Number" provides high evidence (extremely low $u$ ).

4. Machine Learning for ER

Modern ER utilizes Siamese Networks to learn dense embeddings of records.

Architecture: Two identical BERT-style encoders process Record A and Record B.
Loss Function: Triplet Loss or Contrastive Loss, ensuring that matched records have a high Cosine Similarity ( $> 0.9$ ) while non-matches are pushed apart in the vector space.

See Also:

Normalization And Denormalization — Preparing data for ER.
Embeddings In Gen AI — Using vectors for semantic matching.
Knowledge Graph Construction Pipeline — Integrating resolved entities into a graph.

Wikantik

JavaScript is required to use Wikantik. Please enable JavaScript in your browser settings.