Early Risk Stratification of Dosing Errors in Clinical Trials Using Machine Learning
#Machine Learning#Clinical Trials#Dosing Errors#Risk Stratification#XGBoost#ClinicalModernBERT#Probability Calibration#Clinical Research
📌 Key Takeaways
Researchers developed a machine learning framework to predict dosing error risks in clinical trials before they begin
The study used data from 42,112 clinical trials from ClinicalTrials.gov
The late-fusion model combining structured and textual data achieved the best performance with AUC-ROC of 0.862
The framework enables proactive risk management in clinical research through early identification of high-risk trials
📖 Full Retelling
Researchers Félicien Hêche, Sohrab Ferdowsi, Anthony Yazdani, Sara Sansaloni-Pastor, and Douglas Teodoro developed a machine learning framework for early risk stratification of clinical trials based on their likelihood of exhibiting high rates of dosing errors, publishing their findings on arXiv on February 25, 2026, with the aim of identifying potential safety issues before trials begin. The research team constructed a comprehensive dataset from ClinicalTrials.gov comprising 42,112 clinical trials, extracting structured, semi-structured trial data, and unstructured protocol-related free-text information. They assigned binary labels indicating elevated dosing error rates derived from adverse event reports, MedDRA terminology, and Wilson confidence intervals, while evaluating three different approaches: an XGBoost model trained on structured features, a ClinicalModernBERT model using textual data, and a late-fusion model combining both modalities. The late-fusion model achieved the highest performance with an AUC-ROC of 0.862, demonstrating that probability calibration was essential for translating model outputs into reliable and interpretable risk categories, with the proportion of trials labeled as having excessively high dosing error rates increasing monotonically across higher predicted risk groups. This research introduces a reproducible and scalable machine learning framework for early, trial-level risk stratification of clinical trials at risk of high dosing error rates, supporting proactive, risk-based quality management in clinical research.
🏷️ Themes
Machine Learning, Clinical Trials, Risk Stratification, Medical Safety
Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, dietary choices, dietary supplements, and medical devices) and ...
Study of algorithms that improve automatically through experience
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances i...
XGBoost (eXtreme Gradient Boosting) is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Microsoft Windows, and macOS. From the project description, it aims to provide a "Scalable, Portable a...
--> Computer Science > Machine Learning arXiv:2602.22285 [Submitted on 25 Feb 2026] Title: Early Risk Stratification of Dosing Errors in Clinical Trials Using Machine Learning Authors: Félicien Hêche , Sohrab Ferdowsi , Anthony Yazdani , Sara Sansaloni-Pastor , Douglas Teodoro View a PDF of the paper titled Early Risk Stratification of Dosing Errors in Clinical Trials Using Machine Learning, by F\'elicien H\^eche and 4 other authors View PDF HTML Abstract: Objective: The objective of this study is to develop a machine learning -based framework for early risk stratification of clinical trials according to their likelihood of exhibiting a high rate of dosing errors, using information available prior to trial initiation. Materials and Methods: We constructed a dataset from this http URL comprising 42,112 CTs. Structured, semi-structured trial data, and unstructured protocol-related free-text data were extracted. CTs were assigned binary labels indicating elevated dosing error rate, derived from adverse event reports, MedDRA terminology, and Wilson confidence intervals. We evaluated an XGBoost model trained on structured features, a ClinicalModernBERT model using textual data, and a simple late-fusion model combining both modalities. Post-hoc probability calibration was applied to enable interpretable, trial-level risk stratification. Results: The late-fusion model achieved the highest AUC-ROC (0.862). Beyond discrimination, calibrated outputs enabled robust stratification of CTs into predefined risk categories. The proportion of trials labeled as having an excessively high dosing error rate increased monotonically across higher predicted risk groups and aligned with the corresponding predicted probability ranges. Discussion: These findings indicate that dosing error risk can be anticipated at the trial level using pre-initiation information. Probability calibration was essential for translating model outputs into reliable and interpretable risk categories, while simple...