Clinical trials and registries have traditionally been the gold standard for high-quality healthcare data, offering complete longitudinal records across data domains in a clean, analysis-ready format. However, achieving this level of quality involves significant manual labor and often results in a low level of representativeness for the broader population.

Truveta aims to deliver high quality data representative of a large and diverse population, compared to clinical trials and registries. Truveta Data includes complete EHR data for more than 100 million patients, collected from more than 20,000 clinics and 800 hospitals daily. These medical records are then linked with claims, SDOH, and mortality data, providing clean, analytics-ready real-world data.

Ingesting data from member health systems

The initial step in delivering the highest-quality data for researchers is building processes to ingest data from member health systems. Raw patient data from members are sent daily to a secure, cloud-based environment called a Truveta Embassy. These data are processed and assessed continually for quality with direct connection with member health systems fostering a high velocity feedback loop for data quality improvement. This loop is strengthened because member health systems, in turn, leverage Truveta Data and Truveta Studio for research and analytics at their respective institutions.

Cleaning the data for usability

Next, these data, which come in enormously heterogeneous formats and data schema, are transformed to a single data model called the Truveta Data Model (TDM). We refer to this process of data model unification as syntactic normalization. We receive many hundreds of disparate tables from health systems across EHR vendor schemas. Our process of syntactic normalization requires meticulous alignment of all these data to the 40 tables of TDM.

Measuring data quality

Truveta has adopted a standardized approach to ensuring data quality, measured across four industry-standard categories:

Representativeness measures how the patient diversity of Truveta Data compares to the overall US population. We benchmark representativeness at a state level based on whether we have data coverage for at least 10% of the state’s population. This benchmark is set by the Center for Medicare and Medicaid Services (CMS) under their qualified entity framework. As of publication, Truveta has data from 32 states that meet this benchmark.

Completeness assesses the presence of expected data fields and values in the linked, longitudinal patient record, spanning semi-structured data from EHRs (medication administration, conditions, procedures, encounters) as well as unstructured data (notes, images, genomics). To ensure a consistent longitudinal view for a patient across multiple health systems, Truveta utilizes the Truveta token, preserving privacy and enabling precise patient linkage. Beyond core clinical data domains (e.g., conditions, medications), Truveta measures and ensures record-level quality, feeding information on any data gaps back to member health systems so they improve their data feeds to Truveta.

Timeliness measures how quickly data is delivered to Truveta and made available for research. Truveta’s member health systems contribute daily data feeds, providing up-to-date encounter, condition, medication, and other data for immediate ingestion, normalization, deidentification, and provision to researchers in Truveta Studio.

Cleanliness measures whether the data are accurate and plausible, and thus usable for research analytics. Creating a foundation for clean data requires three unique processes:

    • Semantic normalization, measuring how data is translated from source strings to target ontologies (e.g., SNOMED, LOINC), ensuring data are standardized and readily usable for research.
    • Unit of measure normalization, standardizing the units of all measurements, such as patient weights.
    • Clinical validity, measuring whether values, disease prevalence, and other metrics meet clinical expectations.

To support transparency on data quality, every patient population being studied includes a “population datasheet” which provides transparency on each of these metrics.

Sample population datasheet for a GLP-1 comparative effectiveness study
Screenshot of Truveta Studio showing a population datasheet for patients taking GLP-1 medications. Image shows a map of coverage by state, as well as a count of patient lives and total encounters, diagnoses, and medication orders.
Screenshot from Truveta Studio showing demographic details for a population of GLP-1 medication users, including bar graphs comparing Truveta Data to the US Census along the following dimensions: race, age, ethnicity, sex.

Commitment to quality enables data that can be used in regulatory submissions

In addition to stringent data quality standards, Truveta holds itself to the most rigorous standards for building quality systems – including those to support regulatory submissions. These standards include:

      • ISO 9001 certification: We are developing our quality management system (QMS) in line with ISO 9001. This certification will attest that Truveta’s system for processing data end-to-end (from receipt of data from health system to delivery of the data in Truveta Studio) meets the highest bar for reliability.
      • FDA audit examination: We are investing in examinations to assure a submission with Truveta Data contains all needed elements to comply with FDA recommendations, as published through guidance documentation by the FDA’s Center for Biologics Evaluation and Research (CBER) and Center for Drug Evaluation and Research (CDER).

A continually evolving system

As with every system in Truveta, continuous improvement of our data quality process is a top priority. Through ongoing feedback provided to health system members, we encourage and motivate iterative improvement in daily EHR data that is sent to Truveta. For more in-depth insight into our data quality process, download our whitepaper.