Hierarchical algorithm to identify pregnancy start and duration using structured EHR data

by Truveta Research | Nov 9, 2025

Authors: Katherine Brown, PhD, MSN, RN ⊕Truveta, Inc, Bellevue, WA, Amy Sullivan, MS ⊕Truveta, Inc, Bellevue, WA, Esther Kim, PhD ⊕Truveta, Inc, Bellevue, WA, Nadia Tabatabaeepour, MPH ⊕Truveta, Inc, Bellevue, WA, Katherine Kendrick, MPH ⊕Truveta, Inc, Bellevue, WA, Jordan Swartz, MD ⊕Truveta, Inc, Bellevue, WA, Sarah Platt, MS ⊕Truveta, Inc, Bellevue, WA, Sunny Guin, PhD ⊕Truveta, Inc, Bellevue, WA, Emily Webber, PhD ⊕Truveta, Inc, Bellevue, WA

Banner image titled “Estimating pregnancy start and duration” with Truveta branding and gradient background. Represents Truveta’s real-world data research on pregnancy episode estimation using structured EHR data.

Truveta adapted a hierarchical algorithm to estimate pregnancy start and duration using structured EHR data for regulatory-grade research.
Among 6.4 million women with pregnancy-related codes, 3.1 million pregnancy start dates were estimated using biologically plausible durations.
The methodology enables scalable, reproducible cohort creation for post-approval pregnancy safety and drug safety studies.

This blog extends findings from our poster (RWD2) presented at ISPOR EU 2025, “Hierarchical algorithm to identify pregnancy start and duration using structured EHR data.”

Pregnant individuals are often excluded from early-phase clinical trials, leaving major gaps in evidence about how medications affect pregnancy. Real-world data (RWD) can help close those gaps by providing insights into treatment exposure, healthcare use, and outcomes in large, diverse populations.

To conduct these studies, it’s essential to accurately identify when a pregnancy began and how long it lasted. Yet, estimating pregnancy start dates from structured EHR data is challenging because these data often lack clear indicators for conception or last menstrual period.

To address this, Truveta adapted a multi-step algorithm that uses structured clinical records and mother-infant linkages to estimate pregnancy start and duration. This approach supports regulatory-grade research, including post-approval safety studies of medications used during pregnancy.

Methods

Using a subset of Truveta Data, we identified women aged 12–55 with pregnancy-related diagnosis or procedure codes and deterministic mother-infant linkages.

Two complementary methods were applied to estimate gestational age and last menstrual period (LMP):

Gestational age codes: ICD-10-CM Z3A.xx and equivalent SNOMED CT codes were used to estimate weeks of gestation by counting backward from the date of the record.
Outcome-based codes: Delivery and pregnancy outcome codes (e.g., live birth, miscarriage) were used to estimate pregnancy start when gestational age codes were unavailable.

The algorithm integrated these data sources in a hierarchical framework, selecting the most reliable estimate available for each pregnancy. Sequential gestational age and outcome codes were then used to calculate pregnancy duration and define delivery dates.

Mother-infant linkages further strengthened data completeness and enabled analyses that connect maternal characteristics, treatments, and infant outcomes.

Pregnancy duration was estimated using sequential Z3A codes and outcome codes to define delivery. Maternal comorbidities are captured across gestation, and deterministic linkage connects mother and infant records, enabling analysis of birth characteristics and infant outcomes.

Results

Among 6.4 million women with pregnancy-related codes, 3.1 million pregnancies had start dates estimated using biologically plausible durations.

Overall, we identified 3.5 million pregnancy episodes (across 2.9 million women), including:

85 million live births (81.6%)
640,000 pregnancy losses (18.4%)
26 million live births linked to infant records (43%)

SNOMED codes contributed an additional 500,000 pregnancy episodes.

The resulting dataset includes timing, outcome type, and infant linkages—creating a foundation for longitudinal studies that explore care patterns, treatment effects, and maternal-infant outcomes across the course of pregnancy.

Flowchart showing Truveta’s hierarchical cohort selection for pregnancy research using structured EHR data. Begins with 6.4 million women meeting pregnancy definitions, narrowed to 3.7 million with Z3A codes and 3.98 million with outcome codes. Estimated start dates identified for 3.1 and 3.5 million patients, leading to 2.7 million pregnancies with outcomes and 1.26 million with linked children.

Pregnancy episode cohort attrition: Stepwise attrition of women with pregnancy-related codes to final linked pregnancy episodes with outcomes and infants.

Example patient timeline

Gestational age codes estimate pregnancy start and progression, while delivery is defined by the baby’s date of birth or outcome code. This linkage enables comprehensive analyses connecting maternal factors—such as comorbidities or treatments—to newborn and infant outcomes like preterm birth or neonatal conditions.

Timeline graphic showing an example pregnancy episode with sequential ICD-10-CM Z3A codes from November 2023 to March 2024. Highlights 14-, 16-, 24-, and 29-week gestation milestones, maternal conditions like pre-eclampsia and hypertension, and birth outcomes including preterm newborn (ICD-10 P07.33). Illustrates linkage between maternal and infant records.

Discussion

This multi-step algorithm supports robust pregnancy episode construction using structured EHR data. Its hierarchical design improves completeness and precision, particularly when combined with mother-infant linkage.

Future work includes clinician-reviewed validation of estimated pregnancy start dates and durations using note-based gold standards.

This approach enables scalable, reproducible cohort creation for regulatory-grade research, including pregnancy PASS and drug safety evaluations.

Analyses that have already used this algorithm include a study exploring real-world patterns of glucose tolerance testing and gestational diabetes in pregnancy and a follow-on study that evaluates the associations of nutrition counseling, insulin, and metformin therapy with post-glucose tolerance testing weight gain in a large pregnancy cohort.

These findings are consistent with data accessed on May 22, 2025. They are preliminary research findings and not peer reviewed; data are constantly changing and updating.

Citations

Moll K, Wong HL, Fingar K, Hobbi S, Sheng M, Burrell TA, Eckert LO, Munoz FM, Baer B, Shoaibi A, Anderson S. Validating claims-based algorithms determining pregnancy outcomes and gestational age using a linked claims-electronic medical record database. Drug Saf. 2021 Nov;44(11):1151-1164. doi:10.1007/s40264-021-01113-8
Bertoia ML, Phiri K, Clifford CR, Doherty M, Zhou L, Wang LT, Bertoia NA, Wang FT, Seeger JD. Identification of pregnancies and infants within a US commercial healthcare administrative claims database. Pharmacoepidemiology Drug Saf. 2022;31(8):863-874. doi:10.1002/pds.5483

Hierarchical algorithm to identify pregnancy start and duration using structured EHR data

Methods

Results

Example patient timeline

Discussion

Citations

Real-world evidence on heart failure from GLP-1 and advanced CKD populations

Early uptake of oral semaglutide for obesity (Wegovy pill) following FDA approval

Use of GLP-1 RAs following label expansion for patients with CVD and overweight or obesity

Hematologic oncology at scale: From population insight to patient-level risk signals

Evaluating changes in lidocaine and opioid administration on the day of IUD insertion, 2018-2025

Ready to accelerate your research with representative, complete, and timely real-world data?