Truveta brand logo mark in teal on a black background, featuring stacked chevron shapes forming the Truveta symbol.

Generating synthetic clinical notes to improve biomedical AI

by | Jun 12, 2026

Authors: Cheng Cao ⊕,Truveta, Inc, Bellevue, WA, Ying Wei  ⊕, Truveta, Inc, Bellevue, WA, Qi Li, ⊕,  Iowa State University, Ames, IA,  Jay Pillai Truveta, Inc, Bellevue, WA

generating synthetic clinical notes from biomedical AI

Extracting clinical concepts from notes is essential for making electronic health record (EHR) data useful in research. But building models that do this well depends on high-quality annotated examples that are hard to produce at scale. Expert-labeled data can be expensive, limited, or unavailable, especially for rare diseases or highly specific clinical concepts.

In the third paper in Truveta’s series on using generative AI to improve encoder-based models, researchers from Truveta and Iowa State University tested whether AI-generated clinical text could help fill that gap. Earlier studies in the series explored how large language models (LLMs) can strengthen training data when labeled notes are limited or when many unlabeled notes are available. This study moves one step earlier in the data pipeline: What if researchers do not have enough human-generated notes to train a strong named entity recognition model?

The study, “Leveraging SFT and RL for Fine-Tuning LLMs to Generate Supplementary Annotated Data for Entity Recognition,” introduces a framework that uses supervised fine-tuning (SFT) and reinforcement learning (RL) to train LLMs to generate realistic clinical and biomedical text with structured labels. Those synthetic labeled examples were then used to improve BERT-based named entity recognition (NER) models, which identify clinical concepts such as diseases, symptoms, and other attributes in text.

Table titled “Three clinical note challenges, three AI solutions” showing three scenarios where generative AI can help improve biomedical text models: expanding labeled notes with limited variety, annotating unlabeled notes when labels are scarce, and generating synthetic notes and annotations when almost no labeled notes are available. Each scenario is paired with the related problem, Truveta solution, and reference publication.

Building on a three-part AI research series

This paper completes a series from Truveta engineers on how generative AI can help get reliable, clinically accurate signals from notes when labeled data are scarce or inconsistent.

The first paper focused on situations where researchers have human-labeled notes but limited variety, using LLMs to generate realistic alternatives and strengthen models. The second paper focused on situations where researchers have many notes but few labels, using LLMs to generate high-quality annotations at scale. This third paper focuses on the most constrained scenario, where researchers have very few labeled notes.

In this study, the team fine-tuned LLMs with supervised fine-tuning (SFT) and reinforcement learning (RL) to generate new synthetic notes and annotations. The LLM is not used as the final extraction model. Instead, it creates additional training examples that help smaller BERT-based models learn to identify clinical concepts in text.

Synthetic labeled data improved clinical concept extraction

Researchers tested whether synthetic clinical and biomedical text, generated by LLMs and paired with structured labels, could improve models that identify clinical concepts in notes.

They evaluated the approach on two datasets: a public benchmark of 793 biomedical abstracts and a proprietary dataset of 2,306 de-identified clinical notes from Truveta Data. In both datasets, adding synthetic labeled examples helped BERT-based models identify clinical concepts more accurately than training on human-labeled data alone.

The strongest results came from synthetic examples generated with supervised fine-tuning plus reinforcement learning. On the public biomedical dataset, model performance improved from 88.45 to 89.54 on a 100-point F1 scale, a standard measure that rewards models for both finding relevant concepts and avoiding incorrect ones. On the Truveta clinical notes dataset, performance improved from 85.52 to 86.67.

The findings suggest that the LLM’s most useful role was not replacing the extraction model. Instead, it generated realistic notes and labels that helped smaller, more efficient models learn to identify clinical concepts in text.

Synthetic notes appeared realistic enough to support model training

The researchers also tested whether the synthetic clinical notes appeared realistic to expert reviewers. In a blinded review of 20 real and 20 synthetic notes, experienced medical text annotators gave nearly identical average realism scores to synthetic and real notes, 3.66 vs. 3.64 on a five-point scale, with no statistically significant difference between groups.

What synthetic notes could enable for clinical AI

Clinical notes contain critical details including symptoms, disease progression, treatment rationale, timing, negation, and clinical context. Named entity recognition models can help make that information usable for research, but only if they have enough relevant examples to learn from.

This study suggests that synthetic labeled data can supplement human-labeled data when examples are scarce. That could be useful for low-resource biomedical natural language processing tasks, including rare disease research or specialized clinical concept extraction.

The paper does not show that synthetic notes can replace real-world patient data. Instead, it shows that synthetic notes and annotations may help train models that extract information from real-world notes more effectively.

Read the full study

Leveraging SFT and RL for Fine-Tuning LLMs to Generate Supplementary Annotated Data for Entity Recognition, Proceedings of the 19th International Joint Conference on Biomedical Engineering Systems and Technologies