Predicting early-onset colorectal cancer with large language models

by | Apr 21, 2026

Authors: Wilson Lau Truveta, Inc, Bellevue, WA, Youngwon Kim Truveta, Inc, Bellevue, WA, Sravanthi Parasa Swedish Medical Center, Seattle, WA, Md Enamul Haque Truveta, Inc, Bellevue, WA, Anand Oka Truveta, Inc, Bellevue, WA, Jay Nanduri Truveta, Inc, Bellevue, WA

Key points

  • In this study, researchers from Truveta applied machine learning and large language models to predict early-onset colorectal cancer (ages 18–44) using Truveta Data.
  • A fine-tuned GPT-4o model achieved 73% sensitivity and 91% specificity, outperforming traditional machine learning approaches in balancing detection and false positives.
  • The model used conditions, labs, and observations from up to six months prior to diagnosis to identify patients at elevated risk.
  • Findings suggest LLM-based approaches may help identify higher-risk patients earlier and support more targeted screening strategies in younger populations.

Abstract

The incidence rate of early-onset colorectal cancer (EoCRC, age < 45) has increased every year, but this population is younger than the recommended age established by national guidelines for cancer screening.

In this paper, researchers from Truveta applied 10 different machine learning models to predict EoCRC, and compared their performance with advanced large language models (LLM), using patient conditions, lab results, and observations within 6 months of patient journey prior to the CRC diagnoses. They retrospectively identified 1,953 CRC patients from multiple health systems across the United States.

The results demonstrated that the fine-tuned LLM achieved an average of 73% sensitivity and 91% specificity.

Explanation from LLM on predicting the outcome of a CRC patient” showing a text-heavy example of an LLM-generated explanation for colorectal cancer risk. The slide lists patient conditions, lab results, and observations, then compares them with guideline-based symptoms and high-risk factors. It concludes that the patient shows multiple colorectal cancer risk signals, gives an answer of “Yes,” and assigns a 75% probability score. Truveta logo appears at bottom right.

Read the full study

Predicting Early-Onset Colorectal Cancer with Large Language Models, AMIA Annual Symposium Proceedings

Share this

Recent posts

Follow Truveta

Stay up-to-date