Authors: Katherine Brown, PhD, MSN, RN ⊕Truveta, Inc, Bellevue, WA, Amy Sullivan, MS ⊕Truveta, Inc, Bellevue, WA, Esther Kim, PhD ⊕Truveta, Inc, Bellevue, WA, Nadia Tabatabaeepour, MPH ⊕Truveta, Inc, Bellevue, WA, Katherine Kendrick, MPH ⊕Truveta, Inc, Bellevue, WA, Jordan Swartz, MD ⊕Truveta, Inc, Bellevue, WA, Sarah Platt, MS ⊕Truveta, Inc, Bellevue, WA, Sunny Guin, PhD ⊕Truveta, Inc, Bellevue, WA, Emily Webber, PhD ⊕Truveta, Inc, Bellevue, WA
- Truveta adapted a hierarchical algorithm to estimate pregnancy start and duration using structured EHR data for regulatory-grade research.
- Among 6.4 million women with pregnancy-related codes, 3.1 million pregnancy start dates were estimated using biologically plausible durations.
- The methodology enables scalable, reproducible cohort creation for post-approval pregnancy safety and drug safety studies.
This blog extends findings from our poster (RWD2) presented at ISPOR EU 2025, “Hierarchical algorithm to identify pregnancy start and duration using structured EHR data.”
Pregnant individuals are often excluded from early-phase clinical trials, leaving major gaps in evidence about how medications affect pregnancy. Real-world data (RWD) can help close those gaps by providing insights into treatment exposure, healthcare use, and outcomes in large, diverse populations.
To conduct these studies, it’s essential to accurately identify when a pregnancy began and how long it lasted. Yet, estimating pregnancy start dates from structured EHR data is challenging because these data often lack clear indicators for conception or last menstrual period.
To address this, Truveta adapted a multi-step algorithm that uses structured clinical records and mother-infant linkages to estimate pregnancy start and duration. This approach supports regulatory-grade research, including post-approval safety studies of medications used during pregnancy.
Methods
Using a subset of Truveta Data, we identified women aged 12–55 with pregnancy-related diagnosis or procedure codes and deterministic mother-infant linkages.
Two complementary methods were applied to estimate gestational age and last menstrual period (LMP):
- Gestational age codes: ICD-10-CM Z3A.xx and equivalent SNOMED CT codes were used to estimate weeks of gestation by counting backward from the date of the record.
- Outcome-based codes: Delivery and pregnancy outcome codes (e.g., live birth, miscarriage) were used to estimate pregnancy start when gestational age codes were unavailable.
The algorithm integrated these data sources in a hierarchical framework, selecting the most reliable estimate available for each pregnancy. Sequential gestational age and outcome codes were then used to calculate pregnancy duration and define delivery dates.
Mother-infant linkages further strengthened data completeness and enabled analyses that connect maternal characteristics, treatments, and infant outcomes.
Pregnancy duration was estimated using sequential Z3A codes and outcome codes to define delivery. Maternal comorbidities are captured across gestation, and deterministic linkage connects mother and infant records, enabling analysis of birth characteristics and infant outcomes.
Results
Among 6.4 million women with pregnancy-related codes, 3.1 million pregnancies had start dates estimated using biologically plausible durations.
Overall, we identified 3.5 million pregnancy episodes (across 2.9 million women), including:
- 85 million live births (81.6%)
- 640,000 pregnancy losses (18.4%)
- 26 million live births linked to infant records (43%)
SNOMED codes contributed an additional 500,000 pregnancy episodes.
The resulting dataset includes timing, outcome type, and infant linkages—creating a foundation for longitudinal studies that explore care patterns, treatment effects, and maternal-infant outcomes across the course of pregnancy.
Pregnancy episode cohort attrition: Stepwise attrition of women with pregnancy-related codes to final linked pregnancy episodes with outcomes and infants.
Example patient timeline
Gestational age codes estimate pregnancy start and progression, while delivery is defined by the baby’s date of birth or outcome code. This linkage enables comprehensive analyses connecting maternal factors—such as comorbidities or treatments—to newborn and infant outcomes like preterm birth or neonatal conditions.
Discussion
This multi-step algorithm supports robust pregnancy episode construction using structured EHR data. Its hierarchical design improves completeness and precision, particularly when combined with mother-infant linkage.
Future work includes clinician-reviewed validation of estimated pregnancy start dates and durations using note-based gold standards.
This approach enables scalable, reproducible cohort creation for regulatory-grade research, including pregnancy PASS and drug safety evaluations.
Analyses that have already used this algorithm include a study exploring real-world patterns of glucose tolerance testing and gestational diabetes in pregnancy and a follow-on study that evaluates the associations of nutrition counseling, insulin, and metformin therapy with post-glucose tolerance testing weight gain in a large pregnancy cohort.
These findings are consistent with data accessed on May 22, 2025. They are preliminary research findings and not peer reviewed; data are constantly changing and updating.
Citations
- Moll K, Wong HL, Fingar K, Hobbi S, Sheng M, Burrell TA, Eckert LO, Munoz FM, Baer B, Shoaibi A, Anderson S. Validating claims-based algorithms determining pregnancy outcomes and gestational age using a linked claims-electronic medical record database. Drug Saf. 2021 Nov;44(11):1151-1164. doi:10.1007/s40264-021-01113-8
- Bertoia ML, Phiri K, Clifford CR, Doherty M, Zhou L, Wang LT, Bertoia NA, Wang FT, Seeger JD. Identification of pregnancies and infants within a US commercial healthcare administrative claims database. Pharmacoepidemiology Drug Saf. 2022;31(8):863-874. doi:10.1002/pds.5483

