Authors: Katherine Brown, PhD, MSN, RN Truveta, Inc, Bellevue, WA, Amy Sullivan, MS Truveta, Inc, Bellevue, WA, Esther Kim, PhD Truveta, Inc, Bellevue, WA, Nadia Tabatabaeepour, MPH Truveta, Inc, Bellevue, WA, Katherine Kendrick, MPH Truveta, Inc, Bellevue, WA, Jordan Swartz, MD Truveta, Inc, Bellevue, WA, Sarah Platt, MS Truveta, Inc, Bellevue, WA, Sunny Guin, PhD Truveta, Inc, Bellevue, WA, Emily Webber, PhD Truveta, Inc, Bellevue, WA

Banner image titled “Estimating pregnancy start and duration” with Truveta branding and gradient background. Represents Truveta’s real-world data research on pregnancy episode estimation using structured EHR data.
  • Truveta adapted a hierarchical algorithm to estimate pregnancy start and duration using structured EHR data for regulatory-grade research.
  • Among 6.4 million women with pregnancy-related codes, 3.1 million pregnancy start dates were estimated using biologically plausible durations.
  • The methodology enables scalable, reproducible cohort creation for post-approval pregnancy safety and drug safety studies.

This blog extends findings from our poster (RWD2) presented at ISPOR EU 2025, “Hierarchical algorithm to identify pregnancy start and duration using structured EHR data.”

Pregnant individuals are often excluded from early-phase clinical trials, leaving major gaps in evidence about how medications affect pregnancy. Real-world data (RWD) can help close those gaps by providing insights into treatment exposure, healthcare use, and outcomes in large, diverse populations.

To conduct these studies, it’s essential to accurately identify when a pregnancy began and how long it lasted. Yet, estimating pregnancy start dates from structured EHR data is challenging because these data often lack clear indicators for conception or last menstrual period.

To address this, Truveta adapted a multi-step algorithm that uses structured clinical records and mother-infant linkages to estimate pregnancy start and duration. This approach supports regulatory-grade research, including post-approval safety studies of medications used during pregnancy.

Methods

Using a subset of Truveta Data, we identified women aged 12–55 with pregnancy-related diagnosis or procedure codes and deterministic mother-infant linkages.

Two complementary methods were applied to estimate gestational age and last menstrual period (LMP):

  1. Gestational age codes: ICD-10-CM Z3A.xx and equivalent SNOMED CT codes were used to estimate weeks of gestation by counting backward from the date of the record.
  2. Outcome-based codes: Delivery and pregnancy outcome codes (e.g., live birth, miscarriage) were used to estimate pregnancy start when gestational age codes were unavailable.

The algorithm integrated these data sources in a hierarchical framework, selecting the most reliable estimate available for each pregnancy. Sequential gestational age and outcome codes were then used to calculate pregnancy duration and define delivery dates.

Mother-infant linkages further strengthened data completeness and enabled analyses that connect maternal characteristics, treatments, and infant outcomes.

Illustrated timeline showing Truveta’s hierarchical algorithm to estimate pregnancy start and duration using structured EHR data. The timeline begins with last menstrual period (LMP) and a positive pregnancy test, followed by clinical milestones at 8, 12, 16, 20, 24, 28, 32, 36–40 weeks with corresponding ICD-10-CM and SNOMED codes (e.g., Z3A.08, Z3A.12, Z3A.16). The pregnancy duration section highlights maternal comorbidities such as gestational diabetes and preeclampsia, leading to outcome codes for single live birth or cesarean delivery. A lower track represents baby follow-up, showing infant outcomes like hypoglycemia, preterm birth, and respiratory distress. Branded with Truveta logo and caption “Hierarchical algorithm to identify pregnancy start and duration using structured EHR data.”

Pregnancy duration was estimated using sequential Z3A codes and outcome codes to define delivery. Maternal comorbidities are captured across gestation, and deterministic linkage connects mother and infant records, enabling analysis of birth characteristics and infant outcomes.

Results

Among 6.4 million women with pregnancy-related codes, 3.1 million pregnancies had start dates estimated using biologically plausible durations.

Overall, we identified 3.5 million pregnancy episodes (across 2.9 million women), including:

  • 85 million live births (81.6%)
  • 640,000 pregnancy losses (18.4%)
  • 26 million live births linked to infant records (43%)

SNOMED codes contributed an additional 500,000 pregnancy episodes.

The resulting dataset includes timing, outcome type, and infant linkages—creating a foundation for longitudinal studies that explore care patterns, treatment effects, and maternal-infant outcomes across the course of pregnancy.

Flowchart showing Truveta’s hierarchical cohort selection for pregnancy research using structured EHR data. Begins with 6.4 million women meeting pregnancy definitions, narrowed to 3.7 million with Z3A codes and 3.98 million with outcome codes. Estimated start dates identified for 3.1 and 3.5 million patients, leading to 2.7 million pregnancies with outcomes and 1.26 million with linked children.

Pregnancy episode cohort attrition: Stepwise attrition of women with pregnancy-related codes to final linked pregnancy episodes with outcomes and infants.

Table comparing pregnancy outcomes identified in Truveta Data: 2.85 million live births (81.6%), 1.53 million linked births (43.7%), and 641,524 pregnancy losses (18.4%) across 3.49 million records and 2.91 million patients. Includes Truveta logo and study title.
Example patient timeline

Gestational age codes estimate pregnancy start and progression, while delivery is defined by the baby’s date of birth or outcome code. This linkage enables comprehensive analyses connecting maternal factors—such as comorbidities or treatments—to newborn and infant outcomes like preterm birth or neonatal conditions.

Timeline graphic showing an example pregnancy episode with sequential ICD-10-CM Z3A codes from November 2023 to March 2024. Highlights 14-, 16-, 24-, and 29-week gestation milestones, maternal conditions like pre-eclampsia and hypertension, and birth outcomes including preterm newborn (ICD-10 P07.33). Illustrates linkage between maternal and infant records.

Discussion

This multi-step algorithm supports robust pregnancy episode construction using structured EHR data. Its hierarchical design improves completeness and precision, particularly when combined with mother-infant linkage.

Future work includes clinician-reviewed validation of estimated pregnancy start dates and durations using note-based gold standards.

This approach enables scalable, reproducible cohort creation for regulatory-grade research, including pregnancy PASS and drug safety evaluations.

Analyses that have already used this algorithm include a study exploring real-world patterns of glucose tolerance testing and gestational diabetes in pregnancy and a follow-on study that evaluates the associations of nutrition counseling, insulin, and metformin therapy with post-glucose tolerance testing weight gain in a large pregnancy cohort.

These findings are consistent with data accessed on May 22, 2025. They are preliminary research findings and not peer reviewed; data are constantly changing and updating.

Citations

  1. Moll K, Wong HL, Fingar K, Hobbi S, Sheng M, Burrell TA, Eckert LO, Munoz FM, Baer B, Shoaibi A, Anderson S. Validating claims-based algorithms determining pregnancy outcomes and gestational age using a linked claims-electronic medical record database. Drug Saf. 2021 Nov;44(11):1151-1164. doi:10.1007/s40264-021-01113-8 
  2. Bertoia ML, Phiri K, Clifford CR, Doherty M, Zhou L, Wang LT, Bertoia NA, Wang FT, Seeger JD. Identification of pregnancies and infants within a US commercial healthcare administrative claims database. Pharmacoepidemiology Drug Saf. 2022;31(8):863-874. doi:10.1002/pds.5483