Predicting early-onset colorectal cancer with large language models

by Truveta staff | Apr 21, 2026

Authors: Wilson Lau ⊕Truveta, Inc, Bellevue, WA, Youngwon Kim ⊕Truveta, Inc, Bellevue, WA, Sravanthi Parasa ⊕Swedish Medical Center, Seattle, WA, Md Enamul Haque ⊕Truveta, Inc, Bellevue, WA, Anand Oka ⊕Truveta, Inc, Bellevue, WA, Jay Nanduri ⊕Truveta, Inc, Bellevue, WA

Key points

In this study, researchers from Truveta applied machine learning and large language models to predict early-onset colorectal cancer (ages 18–44) using Truveta Data.
A fine-tuned GPT-4o model achieved 73% sensitivity and 91% specificity, outperforming traditional machine learning approaches in balancing detection and false positives.
The model used conditions, labs, and observations from up to six months prior to diagnosis to identify patients at elevated risk.
Findings suggest LLM-based approaches may help identify higher-risk patients earlier and support more targeted screening strategies in younger populations.

Abstract

The incidence rate of early-onset colorectal cancer (EoCRC, age < 45) has increased every year, but this population is younger than the recommended age established by national guidelines for cancer screening.

In this paper, researchers from Truveta applied 10 different machine learning models to predict EoCRC, and compared their performance with advanced large language models (LLM), using patient conditions, lab results, and observations within 6 months of patient journey prior to the CRC diagnoses. They retrospectively identified 1,953 CRC patients from multiple health systems across the United States.

The results demonstrated that the fine-tuned LLM achieved an average of 73% sensitivity and 91% specificity.

Read the full study

Predicting Early-Onset Colorectal Cancer with Large Language Models, AMIA Annual Symposium Proceedings

Truveta Data

Capabilities

Therapeutic areas

Evidence

Truveta Intelligence

Capabilities

Evidence

Truveta customers

Who we serve

Saving Lives with Data

Predicting early-onset colorectal cancer with large language models

Key points

Abstract

Read the full study

Closed claims vs. EHR data: Choosing the right lens for real-world research

GLP-1 RA prescription trends: January 2019 – June 2026

Early uptake of oral orforglipron (Foundayo pill) for obesity following FDA approval

From prediction to partnership: Why the next era of healthcare AI is about amplifying human expertise

Mortality risk may differ across antipsychotics used in Alzheimer’s disease

Ready to accelerate your research with representative, complete, and real-time data?

Interested in learning more?