Truveta recently won the SXSW Innovation Award for Artificial Intelligence, which recognized the groundbreaking work we have been doing to apply modern generative AI technology to healthcare data and analytics. This comes just a little under a year after our announcement sharing more details about  the Truveta Language Model and its use in normalization and mapping of unstructured data in electronic health records (EHR) for enabling cutting-edge clinical research. These achievements would not have been possible without the incredible efforts of the entire Truveta team who have worked tirelessly to further our mission of Saving Lives with Data.

This has been an exciting and eventful journey that has given us many crucial insights and taught us how to avoid potential pitfalls. In my conversations with industry colleagues, I am often asked to share our learnings. So, I thought it would be worthwhile to write a series of articles capturing the arc of our AI journey.

Six challenges

Our AI journey is the story of six fundamental engineering challenges and our approach to solving them. In subsequent articles, I hope to delve deeply into each area and give you a more detailed account of how Truveta has applied AI to deliver a range of solutions for customers.  We had to innovate and think outside the box. We had to bring the mindset of a challenger rather than an incumbent because most existing solutions and frameworks did not sufficiently serve customers’ needs. Not everything we tried worked, and there were many twists, turns, and cul-de-sacs. Fortunately, we eventually converged on a strategic approach that we believe will be the foundation for our future growth. Failing fast and learning fast is the only reliable way to innovate, and we will continue to be loyal to that creed.

Creating data gravity

How to generate and sustain data gravity in a consortium of healthcare data

Our mission is Saving Lives with Data. We wanted to create an AI-driven data and analytics solution for life science and pharmacological enterprises that will further humankind’s most profound aspiration: enabling high-quality and timely healthcare for all. We aimed to serve the needs of all people, including underserved communities. We wanted to gather data from diverse health systems across the country and make it available for research. When we brought this data together, we needed to aim not just for quantity, but also for completeness and representativeness.

Our solution to getting broad-based, representative healthcare data was to create a trusted network via a consortium of healthcare members. Today, Truveta is a growing collective of more than 30 health systems, providing over 18% of the daily clinical care across the US. Each member sends us data that is potentially in different schemas and is, in large parts, unstructured. Our core innovation was bringing this diverse data into a consistent schema called the Truveta Data Model (TDM), through generative AI-driven transformations, semantic and syntactic normalization, Protected Health Information (PHI) detection, tokenization, and patient matching. This enabled usage of the data at scale by healthcare and life science researchers across the population of the whole country, which created an ongoing incentive for bringing in more data.

Ensuring privacy and compliance

How to make sensitive healthcare data, which may include protected health information, available for AI models for clinical and life science research in a safe and compliant manner

Hospital systems provide us with patient healthcare data under strict requirements of compliance with HIPAA privacy and secure use policies. We want to protect the privacy of the patients as well as the business interests of our members.  We ensure that the data that is provided to researchers is fully de-identified and stripped of business-sensitive information through the application of advanced AI-driven redaction and de-identification algorithms. These algorithms are applied across multiple modalities of data, such as tabular EHR data, clinical notes, and images.  Moreover, we continuously monitor and test the quality of these algorithms as they operate at scale.

Truveta’s de-identification process has been certified by external experts for meeting HIPAA Privacy Rule standards. Along with state-of-the-art de-identification, we have additional security and privacy controls, protocols, and processes in place to store and manage PHI, earning Truveta completion of Type 2 SOC 2 attestation, and ISO 27001 certification with ISO 27018 and ISO 27791 extensions.

Delivering accuracy and explainability

How to ensure that the AI technologies used in our platform have requisite accuracy and explainability while assuring fairness and avoidance of bias

Healthcare and life science customers would like to rely on our normalized real-world data (RWD) and analytics platform for their clinical research. They want to publish their research in reputed journals and, in some cases, submit it to regulatory authorities. To be able to do these things with confidence, they require guarantees of accuracy and explainability of the AI models and algorithms used across our data and analytics platform.

We provide these guarantees by purposefully investing in model quality assessment and continuous model upkeep, through a combination of automated testing and expert human supervision. For all our models, be it semantic normalization, concept extraction from clinical notes, personally identifiable information (PII) redaction from images, or patient matching, we measure and publish model quality reports and track them over time. We continuously tune and update our models on the principles of reinforcement learning with human feedback. We have invested deeply in delivering a Quality Management System (QMS/ISO 9001) that applies to the data processing system as well as AI models.

Earning trust and adoption by the research community

How to drive trust and adoption of these advanced AI technologies with our customers

Generating trust in our data and AI is a precursor to driving high levels of adoption by customers. We have a mechanism to create a “fit for purpose” report about our AI models for our customers that assesses the utility of our data for their specific research requirements. Another major pillar of developing trust comes from providing run-time evidence of valid operation of our platform including AI models, which can be used in regulatory grade submissions.

We have an internal Truveta Research team that uses Truveta Data and Truveta Studio for delivering scientifically rigorous research, with frequent publication on our website and in other publications and reputed journals. Their work (studies, code, and data definitions) is available in Truveta Library for all customers to use. We also have a dedicated customer success team that helps customers understand and use our capabilities.

To drive adoption of Truveta Studio, and especially some of its advanced user capabilities, we recognized that we needed to invest in improving usability. Here, generative AI will be a big asset to help customers. We also invest heavily in customer education and enablement materials.

Scaling AI models for growth

How to scale AI technologies—for many researchers, many studies, large populations, and increasingly complex analyses – while controlling costs and ensuring stability

When dealing with large models like LLMs and large foundation models, it is imperative that we scale in a smart way. Pretraining an LLM needs a huge amount of time, money, and energy; hence, we use it in a judicious manner. We make strategic decisions about what knowledge to retain in the model and what can be forgotten during pretraining. Where possible, we use techniques of prompt engineering and model finetuning, as well as retrieval augmented generation (RAG) which typically have a high ROI.

Another important innovation is the use of a network of specialized agents where each agent is of modest size and is trained to do a specific task very well. By appropriately chaining and orchestrating such agents for solving a complex task, we can achieve high-quality results without having to train large models for every complex task.

Supporting adjacencies

How to rapidly support new AI scenarios and adjacencies through reuse and transfer learning

Investing in a large-scale AI platform is a costly and time-consuming proposition. Therefore, we want to ensure we build a system that is modular and generic enough that it can serve newly emerging adjacent applications. In healthcare, there are specialized domains such as safety and effectiveness, drug discovery, label expansion, precision medicine, and design of clinical trials. All these special domains can hugely benefit from a core set of common knowledge and AI assets, and we want to be able to cross-leverage those assets effectively.

Similarly, while Truveta can be used for clinical research with real-world evidence, it can also be leveraged as a “registry” or feed into a registry, due to the power of the TDM. Similarly, it could also be used for optimizing healthcare delivery operations and quality of care.


The challenges described above are roughly ordered by the priority with which we had to tackle them. For example, solving the question of compliant use of data was obviously something we needed to tackle first – without big data there can be no AI!  Similarly, driving trust and adoption of core scenarios had to precede plans for scale or application to adjacencies.

In subsequent articles, I will go deeper into each of the areas above, in terms of technology, implementation, and future evolution. Please stay tuned.

Of course, our journey is not over and there is much to learn. Nevertheless, we feel that our industry peers, who are themselves at various stages of this journey, may benefit from our experience. Conversely, we would love to hear about your experience and benefit from your insights.

Join the conversation on LinkedIn.