Authors: Cheng Cao ⊕Truveta, Inc, Bellevue, WA, Jay Pillai ⊕Truveta, Inc, Bellevue, WA, Sara Daraei ⊕Truveta, Inc, Bellevue, WA, Sina Ghadermarzi ⊕Truveta, Inc, Bellevue, WA
Key points
-
Peer-reviewed paper, Linking patient records at scale with a hybrid approach combining contrastive learning and deterministic rules, describes Truveta’s hybrid approach to patient record linkage, combining a transformer-based embedding model with deterministic matching rules.
-
The embedding model converts identifying fields (such as name, date of birth, and ZIP code) into numeric representations, enabling records to be linked even when information contains inconsistencies or minor errors.
-
Deterministic rules are applied when high-confidence identifiers are available, balancing flexibility with precision.
-
The approach outperformed baseline methods and is now deployed in production across more than 200 million records, supporting real-world use cases including linking EHR, claims, and mortality data.
Abstract
Linking patient records across disparate healthcare systems is essential to create comprehensive views of patient health, yet this task is complicated by inconsistent identifiers and data quality issues.
Although traditional deterministic and probabilistic record linkage methods have long been used for this purpose, deterministic approaches are brittle in the presence of noisy personally identifiable information (PII), while probabilistic approaches are often difficult to scale. As a result, large-scale linkage commonly relies on restrictive matching strategies that limit recall.
This work presents a hybrid record linkage approach that integrates a deep embedding model with deterministic rules, leveraging both the flexibility and noise robustness of soft embeddings and the reliability and predictable baseline performance of deterministic rules. Using a large-scale real-world dataset, a BERT-based embedding model is fine-tuned in a Siamese network with contrastive loss to encode PII fields as numeric vectors. De-duplicated identifiers (Fuzzy IDs) are then obtained through a blocking-and-clustering step using the embedding vectors.
The approach is evaluated using multiple signals (Social Security Number, phone, and email) and is shown to outperform baseline methods. A post-processing step based on deterministic rules allows embedding-based linkage to be overridden in a subset of cases where high-confidence rules apply, such as when a high-quality identifier is available. The system is deployed on a commercial database consisting of more than 200 million PII records, demonstrating scalability in a real-world healthcare setting.
Read the full paper in Biology Methods and Protocols.

