Imputation of Unknown Missingness in Sparse Electronic Health Records
#Electronic Health Records#Missing Data#Machine Learning#Data Imputation#Healthcare AI#Unknown Missingness#Transformer Network#Medical Data Processing
📌 Key Takeaways
Researchers developed a transformer-based neural network to handle unknown missingness in EHRs
The algorithm distinguishes between truly missing data and unrecorded information
Method showed improved accuracy compared to existing imputation approaches
Achieved statistically significant improvement in hospital readmission prediction tasks
📖 Full Retelling
Researchers Jun Han, Josue Nassar, Sanjit Singh Batra, Aldo Cordova-Palomera, Vijay Nori, and Robert E. Tillman introduced a new machine learning algorithm for addressing unknown missingness in electronic health records in a paper submitted to arXiv on February 24, 2026, aiming to overcome limitations in current medical data processing techniques that struggle to distinguish between truly missing data and unrecorded information. The research addresses a significant challenge in healthcare analytics where electronic health records often contain incomplete information not just due to missing values, but because it's frequently unclear whether a particular piece of data should exist at all—a phenomenon the researchers term 'unknown unknowns.' For instance, a missing diagnosis code could indicate either that the patient was never diagnosed with a condition or that a diagnosis occurred but wasn't properly recorded in the system. To solve this problem, the team developed a transformer-based denoising neural network that adaptively thresholds output to recover values in cases where the algorithm predicts data are missing. The researchers tested their approach on real EHR datasets and found it significantly outperformed existing imputation methods, particularly when applied to predicting hospital readmissions—a critical healthcare application where accurate data representation directly impacts patient outcomes.
🏷️ Themes
Machine Learning, Healthcare Technology, Data Imputation
Digital collection of patient and population electronically stored health information
An electronic health record (EHR) is the systematized collection of electronically stored patient and population health information in a digital format. These records can be shared across different health care settings. Records are shared through network-connected, enterprise-wide information syste...
Study of algorithms that improve automatically through experience
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances i...
Artificial intelligence in healthcare is the application of artificial intelligence (AI) to analyze and understand complex medical and healthcare data. In some cases, it can exceed or augment human capabilities by providing better or faster ways to diagnose, treat, or prevent disease.
As the widespr...
--> Computer Science > Machine Learning arXiv:2602.20442 [Submitted on 24 Feb 2026] Title: Imputation of Unknown Missingness in Sparse Electronic Health Records Authors: Jun Han , Josue Nassar , Sanjit Singh Batra , Aldo Cordova-Palomera , Vijay Nori , Robert E. Tillman View a PDF of the paper titled Imputation of Unknown Missingness in Sparse Electronic Health Records, by Jun Han and 5 other authors View PDF HTML Abstract: Machine learning holds great promise for advancing the field of medicine, with electronic health records serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines. Subjects: Machine Learning (cs.LG) ; Artificial Intelligence (cs.AI) Ci...