Why is this study needed?
Sepsis is a leading cause of paediatric death internationally, with in-hospital mortality estimates varying from 5-20%. Delays to diagnosis and treatment initiation are associated with poorer outcomes. Existing models, such as the recently published diagnostic criteria, Phoenix Sepsis Criteria (PSC), identify children with sepsis and/or predict mortality. They can tell us who has sepsis, but not who is likely to develop it.
What is the problem
Few models exist to predict which children are likely to develop sepsis, when early diagnosis and treatment may be most useful. Currently, no multicentre models in the emergency setting predict organ dysfunction before it is present.
Without an accurate tool for predicting paediatric sepsis, children are at risk of preventable harm from delays to identification and treatment.
Alpern ER, Scott HF, Balamuth F, et al. Derivation and Validation of Predictive Models for Early Pediatric Sepsis. JAMA Pediatr. 2025;179(12):1318–1325. doi:10.1001/jamapediatrics.2025.3892
What were the aims of the study?
To develop machine learning models, using data from the electronic health record (EHR) obtained in the first four hours after an emergency department (ED) presentation, to predict the likelihood of developing sepsis with organ dysfunction within 48 hours.
What they did
This was a multisite, registry study involving EHR data from five health systems in the Paediatric Emergency Care Applied Research Network (PECARN). Study sites included five academic quaternary-care EDs and three affiliated community EDs and hospitals.
Study data included all ED visits for children aged 2 months to <18 years, with the exclusions of ED diagnosis of:
- death or transfer to a non-registry hospital;
- trauma;
- sepsis present during the predictive features window (four hours).
The primary outcome was sepsis as defined by a PSC score of greater than or equal to two or death in an ED visit with suspected infection.
The models used data obtained in the first four hours of ED presentation, including:
(1) acuity measures (Emergency Severity Index, arrival mode, prior notification or referral);
(2) clinical observations (weight, age-adjusted vital signs, oxygen saturation, pain score);
(3) medical complexity markers (ED utilisation in prior year, complex chronic conditions, presence of an indwelling central venous line or tracheostomy);
(4) biological variables (age, sex assigned at birth).
The performance of machine learning algorithms (Ridge regression and gradient tree boosting) was compared for predicting sepsis and septic shock at 90% sensitivity.
What is gradient tree boosting
Machine learning models can take many forms. One approach used in this study was gradient tree boosting, a technique that combines many simple decision trees to improve predictive accuracy. A decision tree works by splitting data into smaller groups based on variables that best separate outcomes—for example, vital signs or laboratory results that differ between patients with and without sepsis.
Rather than relying on a single tree, gradient boosting builds many trees sequentially. Each new tree focuses on correcting the errors made by the previous ones. By gradually refining the model in this way, the algorithm can capture complex relationships between clinical variables that might be difficult to detect using traditional statistical approaches.
In practice, this means gradient tree boosting can integrate multiple clinical data points, such as heart rate, blood pressure, laboratory markers, and other patient data, to estimate the probability of sepsis or septic shock. This ability to model non-linear relationships and interactions between variables is one reason these models often perform well in clinical prediction tasks.
What were the results?
2.3 million ED visits were included in the study, with over 1.6 million in the training set to derive the models and 719,298 in the validation cohort to test them. The cohort was 48.6% female, with a median age of 4.7 years.
Rates of PSC sepsis (0.35% in the training cohort, 0.37% in the validation cohort) and PSC shock (0.15% in both cohorts) were comparable across cohorts. Overall mortality was 0.015%. Mortality in visits meeting PSC sepsis criteria was 2.164%, and PSC shock was 3.307%.
Models were developed and validated to predict Phoenix Sepsis Criteria scores. Assessment of performance characteristics showed that the models had high AUROCs and meaningful positive likelihood ratios. In particular, AUROC of 0.92 (95%CI, 0.92-0.93) for logistic regression and 0.94 (95%CI, 0.93-0.94) for gradient tree boosting. The gradient tree boosting models had positive likelihood ratios ranging from 4.67 (95% CI, 4.61-4.74) to 6.18 (95% CI, 6.08-6.28) for sepsis and from 4.16 (95% CI, 4.07-4.24) to 5.83 (95% CI, 5.67-5.99) for septic shock.
This indicates the successful development of models capable of predicting a rare outcome (paediatric sepsis) before the onset of organ dysfunction. The top 20 features for the gradient boosting tree model were presented (with and without the Emergency Severity Index).
Similar model performance was demonstrated in fairness tests, with comparable results across demographic characteristics, except for payer: performance was better for patients with Medicaid insurance than for those with commercial payers.

What were the limitations of the study?
Although the study aimed to avoid markers of clinical judgement, outcomes related to organ dysfunction may be influenced by clinical processes. For example, the absence of a documented oxygen saturation (an important feature in the model) may serve as a proxy for the absence of respiratory distress.
The most successful model, gradient tree boosting, is complex. This may present challenges for implementation into clinical practice, particularly regarding explainability.
False positive predictions in the setting of a rare outcome present a risk of “alarm fatigue”
The study was conducted in quaternary and affiliated community EDs. Further work is required to assess generalisability to general ED and pre-hospital settings.
The model training dataset consisted of pre-COVID data (temporal holdout validation cohort: January 2016 – February 2020), obtained prior to the global COVID-19 pandemic. The validation cohort was obtained post-COVID (Jan 2021 – December 2022), with the intervening months (March – December 2020) excluded.
CASP checklist for Clinical Prediction Rule (CPR) studies
Is the CPR clearly defined?
Yes. The study uses EHR data from the first 4 hours following ED presentation to predict sepsis development over the subsequent 48 hours. Sepsis was defined by suspected infection with a PSC score of two or more or death.
Did the population from which the rule was derived include an appropriate spectrum of patients?
Yes. Multi-centre cohort of all, undifferentiated ED presentations in children aged 2 months to <18 years. Children with a diagnosis of trauma, ED death, or existing sepsis were excluded. Children transferred to a non-participating centre were excluded.
Was the rule validated in a different group of patients?
Yes. The study used temporal holdout validation. The model was trained on a dataset from January 1, 2016, to February 29, 2020. The model was validated on a dataset from January 1, 2021, to December 31, 2022.
Were the predictor variables and the outcome evaluated in a blinded fashion?
Not explicitly stated.
Were the predictor variables and the outcome evaluated in the whole initial sample?
No. Rates of data missingness for the Phoenix Sepsis Criteria in the training and temporal validation cohorts were provided in the supplement. Missing data for outcomes (primary and secondary) were not described.
Are the statistical methods used to construct and validate the rule clearly described?
Yes.
Can the performance of the rule be calculated?
Yes. For each model, AUROC, sensitivity, specificity, likelihood ratio, positive predictive value, and number needed to treat were provided.
How precise was the estimate of treatment effect?
Did they try to refine the rule with other variables to see whether the precision could be improved or the rule simplified?
Yes, although further simplification is highlighted for further work. Models were trialled with and without the Emergency Severity Index, with similar performance characteristics. However, the presented models are complex, and the authors highlighted the role for future work evaluating simpler models.
Would the prediction rule be reliable and the results interpretable if used for your patient?
Potentially. The authors highlight the model’s complexity as a limitation to implementation, including interpretability and explainability, and note a role for integrating risk outputs into clinical workflows.
Would the rules’ results modify your decision about how to manage the patient or the information you can give them?
Not yet – it requires prospective validation using potentially simpler models or integrating provider judgement in the process.
What did the authors conclude, and what does it mean for current practice?
The authors developed and validated models with high AUROC and positive likelihood ratios to predict the development of paediatric sepsis using EHR data collected in the emergency setting. The gradient tree boosting model for sepsis (90% sensitivity) achieved a number needed to treat of 59.
Limited positive predictive values highlight the challenges of predicting paediatric sepsis and introduce the potential for alarm fatigue. The authors highlight the role of future studies incorporating clinician judgment to improve predictive models.
A note from the author, Elizabeth Alpern
This work derived and validated predictive models using routinely collected electronic health record data to predict sepsis before it occurs. We think it helps show the power of electronic health record data linked to machine learning methods.
The strengths of this work include the successful development of models capable of early prediction with excellent performance characteristics. However, it also indicates the the complexity in predicting rare outcomes.
Next steps are to focus on refining risk thresholds, improving predictive value, and evaluating the model’s performance when embedded in ED workflows, which will require thoughtful implementation.









