Truveta Language Model unlocks EHR data for research

Truveta’s clinical expert-led AI delivers the cleanest healthcare data, including structured concepts from clinician notes

BELLEVUE, Wash. – April 12, 2023 – Today Truveta introduces the Truveta Language Model (TLM), a large-language, multi-modal AI model for transforming electronic health record (EHR) data into billions of clean and accurate data points for health research on any drug, disease, or device. TLM’s healthcare expertise is trained on the largest collection of complete medical records representing the full diversity of the United States. It is the first large-language model specifically designed to empower researchers to study all patient care and outcomes.

As healthcare considers the potential of AI and real-world data, the opportunities and potential consequences are real. General large language models understand language but are inaccurate within the medical domain due to being trained on the public internet, which contains no real medical records. In contrast, TLM combines pre-trained open large language models with additional training on the most complete and representative clinical data set to achieve above 90% accuracy on diagnoses, medications, lab results, lab values, clinical observations, and more.

While claims data are the standard of data used in health research today, they are created by normalizing electronic health record (EHR) data to maximize revenue reimbursement for encounters, medications, and labs, resulting in commercial bias in all claims data-based health research. Instead, TLM normalizes EHR data to maximize clinical accuracy and is trained without commercial bias, helping ensure research is conducted with data focused on clinical outcomes, not billing.

With TLM, Truveta’s community of healthcare and life science customers are currently studying concepts previously inaccessible in messy clinician notes but now structured for analytics, such as seizure frequency, changes in treatment regimen, and adverse reactions to medication.

“Other industries are benefiting from the advancement of AI, but the private, fragmented, and unstructured nature of healthcare data has made applying AI to patient data extremely challenging to this point,” said Jay Nanduri, chief technology officer, Truveta. “Accurate AI requires the most advanced technology matched with an incredible volume of data trained by the best expertise. By using clinical expert-led AI to unlock the power of rich healthcare data, researchers can now ask and answer complex medical questions of a real-time, fully transparent view of U.S. health.”

Delivering the cleanest healthcare data

Healthcare data is recorded in heterogeneous systems with millions of different ways clinicians, hospitals, and health systems express observations, diagnoses, medication plans, and more. Clinicians use different terms based on their location, training, and expertise. “Acute COVID-19,” “COVID,” “COVID-19,” “COVID infection,” and “COVID19 _ acute infection” (and hundreds of other variations) all refer to COVID-19, and “600mg Ibuprofen” and “Ibuprofen 600mg” are the same thing. To analyze healthcare data, this diverse medical language, including misspellings or abbreviations, must be normalized to medical information ontologies (e.g., LOINC for lab tests, GUDID for medical devices, etc.).

AI models are only as good as the data they are trained upon. TLM is trained upon data from Truveta’s health system members currently representing more than 80 million patient journeys, including 5.5 billion diagnoses, 3.1 billion encounters, and 2.4 billion medication orders. Updated daily, Truveta Data combines this EHR data with insurance claims, mortality, and social drivers of health data, for unmatched breadth and depth of data for research. Using this unprecedented data, Truveta’s clinical expert annotation team labels tens of thousands of raw clinical terms to train TLM to normalize healthcare data for clinical accuracy, and then checks the results of the model as it runs.

“Truveta Language Model has been trained to understand medical terminology and concepts,” said Cezary Marcjan, vice president of AI technology at Truveta. “Our approach removes commercial bias in today’s claims data normalization. The system is solving billions of normalization problems, mapping millions of medical concepts every day with high confidence.”

Unlocking the depth of information within clinician notes

Clinician notes hold critical information about the patient journey, such as disease stages, adverse events, medication change rationales, and disease symptoms not found in claims data sets, nor found in most structured EHR analytics data. For example, a structured dataset might include a medication and later a diagnosis of a rash, but the clinician note is the only place where those two concepts are connected, showing the rash as an adverse reaction to the medication.

TLM combines general large language models that understand English with rich medical expertise to structure these concepts from clinician notes. Truveta Data today include more than 2.5 billion notes and growing every day. TLM can identify and normalize clinical concepts identified within a clinician note, as well as detect negation (e.g., “patient denies feeling fatigued”), and map relationships between detected concepts to increase the accuracy of the structured concepts. TLM applies reason over the entire medical record, accounting for changes over time, to ensure the most accurate and complete information is structured. With TLM, a researcher studying cancer would be able to see when a therapy is no longer working or when updated images indicate new disease progression that requires a change in treatment.

“COVID-19 showed many gaps in clinical data and the grim impact they can have. Initially, COVID-19 symptoms and details weren’t put in the structured data, but often captured in the clinician notes, making it hard to detect the virus’ patterns quickly. Yet, COVID-19 isn’t the only once-rare disease or pandemic we’ll face as a society,” said Michael Lucas, principal machine learning engineer at Truveta. “Clinician notes capture missed medications, diagnoses, surgeries, symptoms, adverse events, and so much more. By adopting state of the art AI, Truveta makes critical insights available for research to find cures faster.”

To learn more and schedule a demo, visit truveta.com or contact us at info@truveta.com.

About Truveta

Truveta was formed and governed by US health systems with a shared vision of saving lives with data. Truveta now offers the world’s first health data and analytics solution to study patient care and outcomes. To learn more, please follow us on LinkedIn and visit truveta.com.

About Truveta’s Members

Truveta’s 28 members provide 16% of patient care in the United States in more than 20,000 clinics and 700 hospitals. De-identified data from this care is provided to Truveta daily. Truveta membership includes Providence, Advocate Health, Trinity Health, Tenet Healthcare, Northwell Health, AdventHealth, Baptist Health of Northeast Florida, Baylor Scott & White Health, Bon Secours Mercy Health, Centura Health, CommonSpirit Health, Hawaii Pacific Health, HealthPartners, Henry Ford Health System, HonorHealth, MedStar Health, Memorial Hermann Health System, MetroHealth, Novant Health, Ochsner Health, Premier Health, Saint Luke’s Health System, Sentara Healthcare, Texas Health Resources, TriHealth, UnityPoint Health, Virtua Health, and WellSpan Health.

Truveta Data

Capabilities

Therapeutic areas

Evidence

Truveta Intelligence

Capabilities

Evidence

Truveta customers

Who we serve

Saving Lives with Data

Truveta Language Model unlocks EHR data for the most complete and accurate medical research

Truveta’s clinical expert-led AI delivers the cleanest healthcare data, including structured concepts from clinician notes

Delivering the cleanest healthcare data

Unlocking the depth of information within clinician notes

About Truveta

About Truveta’s Members

Share this

Recent posts

Follow Truveta

Sign up for our newsletter

Ready to accelerate your research with representative, complete, and real-time data?

Interested in learning more?