Truveta brand logo mark in teal on a black background, featuring stacked chevron shapes forming the Truveta symbol.

ISPOR 2026: ​​Zero-shot chain-of-agents framework predicts one-year lung cancer risk directly from longitudinal EHR

by | May 18, 2026

Authors: Ehsan Alipour, MD, PhD ⊕,Truveta, Inc, Bellevue, WA, Wilson Lau, PhD ⊕, Truveta, Inc, Bellevue, WA, Youngwon Kim, PhD ⊕, Truveta, Inc, Bellevue, WA, Sihang Zeng ⊕, Truveta, Inc, Bellevue, WA, Anand Oka, PhD ⊕, Truveta, Inc, Bellevue, WA, Jay Nanduri, MBA, MS Truveta, Inc, Bellevue, WA

  • TrajOnco, a zero-shot chain-of-agents (CoA) framework, predicted one-year lung cancer risk directly from raw longitudinal electronic health records (EHR), achieving an AUROC of 0.871 (95% CI: 0.855–0.885) in a real-world cohort of 500 cases and 125,000 controls.
  • Performance was comparable to or modestly lower than trained machine learning models such as XGBoost and logistic regression, without requiring feature engineering, data cleaning, or task-specific model training.
  • The framework produced temporally coherent, evidence-linked clinical reasoning at both the patient and population level, supporting interpretable early detection of lung cancer.

This report builds on our abstract presented at ISPOR 2026, titled Zero-shot lung cancer risk prediction from longitudinal electronic health records with chain-of-agents framework, as part of the TrajOnco project.

Lung cancer remains the leading cause of cancer-related death in the United States, and early identification of high-risk individuals is critical to improving outcomes through timely screening. Traditional machine learning (ML) models for risk prediction typically depend on substantial data preprocessing, manual feature engineering, and task-specific training. These steps are resource intensive, limit scalability across health systems, and often produce predictions that are difficult to interpret in a clinical context.

Large language models (LLMs) offer a complementary approach. By reasoning over heterogeneous clinical text and structured events directly, an LLM-based agent system can in principle estimate risk from raw longitudinal EHR without any model training. In this work, we evaluated whether TrajOnco, a zero-shot chain-of-agents framework with long-term memory, could predict one-year lung cancer risk directly from raw longitudinal EHR and produce clinically meaningful reasoning.

Methods

Using a subset of Truveta Data (de-identified EHR including more than 130 million patient journeys across leading US health systems), we identified lung cancer cases using clinician-curated diagnostic codes. We sampled 500 lung cancer cases and 125,000 randomly sampled controls, reflecting the approximate cumulative incidence of lung cancer in the source population. For each patient, all EHR history prior to one year before diagnosis (or the matched index date for controls) was included, with a two-month wash-out window applied to limit contamination from diagnostic work-up.

Each patient trajectory was serialized into chronological XML chunks containing conditions, medications, procedures, laboratory results, and observations. The TrajOnco framework processed these chunks sequentially using a series of worker agents that communicated through a long-term memory module. A manager agent then synthesized the worker outputs into a patient-level summary, a 1-to-10 risk score, and an evidence-linked rationale that referenced specific events in the patient record.

We compared TrajOnco (built on GPT-4.1-mini) to three trained ML baselines (logistic regression, XGBoost, k-nearest neighbors) and to a single-agent LLM baseline. Discrimination was evaluated using AUROC and AUPRC. Operating characteristics (sensitivity, specificity, PPV, NPV) were assessed at a threshold chosen to balance sensitivity and specificity.

Results

Discrimination and operating characteristics

In the unmatched cohort, TrajOnco achieved an AUROC of 0.871 (95% CI: 0.855–0.885) and AUPRC of 0.071 (95% CI: 0.053–0.092) for 1-year lung cancer risk. At the selected threshold, the framework achieved a sensitivity of 0.772, specificity of 0.825, PPV of 0.017, and NPV of 0.999, consistent with strong rule-out performance in a low-incidence screening setting. Performance was comparable to, or only modestly below, trained ML baselines, despite requiring no model training or feature engineering. The full performance comparison is shown in Table 1.

Table comparing performance metrics for five models predicting 1-year lung cancer risk from longitudinal electronic health record data. Models include Logistic Regression, XGBoost, KNN, a single-agent GPT-4.1-mini model, and the TrajOnco chain-of-agents framework based on GPT-4.1-mini. Metrics reported include AUROC, AUPRC, negative predictive value (NPV), positive predictive value (PPV), sensitivity, and specificity with confidence intervals. XGBoost achieved the highest AUROC at 0.925 and the highest AUPRC at 0.195. The TrajOnco framework achieved an AUROC of 0.871 and sensitivity of 0.772, outperforming the single-agent GPT-4.1-mini model on most metrics while maintaining high specificity and near-perfect NPV.

Case study: temporally coherent reasoning

Beyond aggregate metrics, an LLM-based framework should produce reasoning that a clinician can interpret. In one representative case, a patient with a longstanding history of chronic obstructive pulmonary disease (COPD) and tobacco use was followed across several years of EHR. Early in the trajectory, TrajOnco assigned a low-risk score, citing tobacco exposure documented in the social history. As the record progressed, the framework integrated new evidence, including COPD and imaging findings, and elevated the risk score in a stepwise fashion. The manager agent’s final rationale linked each component of the score to specific dated events in the record, allowing the reasoning to be traced back to the underlying EHR rather than treated as a black-box prediction.

By contrast, the single-agent LLM baseline often summarized the long trajectory with a lost-in-the-middle phenomenon and missed key longitudinal patterns. An LLM-as-a-judge evaluation confirmed that TrajOnco produced more complete, temporally coherent, and clinically grounded reasoning than the single-agent baseline.

Timeline visualization showing longitudinal electronic health record data and changing lung cancer risk predictions generated by the TrajOnco chain-of-agents framework from 2011 through 2020. The upper panel displays raw EHR events across four categories: conditions, laboratory results, medications, and observations. Events include tobacco dependence, COPD, depressive disorder, respiratory rate measurements, laboratory tests, and medications such as dexamethasone, rituximab, and albuterol.</p>
<p>The lower panel shows how the model’s predicted lung cancer risk evolves over time, progressing from low risk around 2012 to moderate risk between 2013 and 2016, and then to sustained high risk beginning around 2016. Annotated clinical events associated with increasing risk include tobacco dependence, COPD, abnormal blood counts, neutrophilia, cough, and upper respiratory symptoms. Orange diamond markers represent clinical events retained in model memory during risk prediction.

Discussion

This work demonstrates that a zero-shot chain-of-agents framework can predict 1-year lung cancer risk directly from raw longitudinal EHR with performance comparable to trained machine learning models. Three findings are particularly relevant for real-world deployment. First, the framework operates without feature engineering or task-specific training, which substantially lowers the implementation burden for health systems and may allow rapid adaptation to new use cases. Second, the patient-level rationales link risk estimates to specific dated events in the record, supporting clinical interpretability.

Limitations include the dependence on a proprietary base LLM, the use of EHR data that may underrepresent care delivered outside contributing health systems, and the need for prospective evaluation before clinical use. Future work will explore integration with screening eligibility workflows and external validation across additional populations.

These findings are consistent with data accessed in December 2025.

Citations

  1. American Diabetes Association Professional Practice Committee for Diabetes*, M. Bajaj, R. G. McCoy, K. Balapattabi, R. R. Bannuru, N. J. Bellini, A. K. Bennett, E. A. Beverly, K. Briggs Early, S. ChallaSivaKanaka, J. B. Echouffo-Tcheugui, B. M. Everett, R. Garg, L. M. Laffel, R. Lal, G. Matfin, N. Pandya, E. J. Pekas, A. L. Peters, S. J. Pilla, G. R. Romeo, S. E. Rosas, A. R. Segal, K. M. Simmons, E. D. Szmuilowicz, N. A. ElSayed, 2. Diagnosis and Classification of Diabetes: Standards of Care in Diabetes—2026. Diabetes Care 49, S27–S49 (2026).
  2. Centers for Disease Control and Prevention, Gestational Diabetes (2024). https://www.cdc.gov/diabetes/about/gestational-diabetes.html.
  3. T. A. Buchanan, A. H. Xiang, K. A. Page, Gestational diabetes mellitus: risks and management during and after pregnancy. Nat. Rev. Endocrinol. 8, 639–649 (2012).
  4. The HAPO Study Cooperative Research Group, Hyperglycemia and Adverse Pregnancy Outcomes. N. Engl. J. Med. 358, 1991–2002 (2008).
  5. L. Bellamy, J.-P. Casas, A. D. Hingorani, D. Williams, Type 2 diabetes mellitus after gestational diabetes: a systematic review and meta-analysis. The Lancet 373, 1773–1779 (2009).
  6. American Diabetes Association Professional Practice Committee, N. A. ElSayed, R. G. McCoy, G. Aleppo, K. Balapattabi, E. A. Beverly, K. Briggs Early, D. Bruemmer, J. B. Echouffo-Tcheugui, L. Ekhlaspour, R. Garg, K. Khunti, R. Lal, I. Lingvay, G. Matfin, N. Pandya, E. J. Pekas, S. J. Pilla, S. Polsky, A. R. Segal, J. J. Seley, R. C. Stanton, R. R. Bannuru, 15. Management of Diabetes in Pregnancy: Standards of Care in Diabetes—2025. Diabetes Care 48, S306–S320 (2025).
  7. ACOG Practice Bulletin No. 190: Gestational Diabetes Mellitus. Obstet. Gynecol. 131, e49–e64 (2018).
  8. S. Thangaratinam, E. Rogozinska, K. Jolly, S. Glinkowski, T. Roseboom, J. W. Tomlinson, R. Kunz, B. W. Mol, A. Coomarasamy, K. S. Khan, Effects of interventions in pregnancy on maternal weight and obstetric outcomes: meta-analysis of randomised evidence. BMJ 344, e2088–e2088 (2012).
  9. K. Kapur, A. Kapur, M. Hod, Nutrition Management of Gestational Diabetes Mellitus. Ann. Nutr. Metab. 76, 17–29 (2020).
  10. R. Wu, Q. Zhang, Z. Li, A meta-analysis of metformin and insulin on maternal outcome and neonatal outcome in patients with gestational diabetes mellitus. J. Matern. Fetal Neonatal Med. 37, 2295809 (2024).