Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach
#AI Evaluation #Human Ratings #Rater Effects #Item Response Theory #Psychometric Modeling #Rasch Model #OpenAI Dataset #Systematic Error
📌 Key Takeaways
- Human evaluations in AI often contain systematic errors that affect model reliability
- The paper introduces psychometric rater models to correct for rating biases
- Researchers demonstrated the approach using the OpenAI summarization dataset
- The method provides more accurate assessments and insights into rater performance
- This approach enables more transparent and principled use of human data in AI development
📖 Full Retelling
Researchers Jodi M. Casabianca and Maggie Beiting-Parrish introduced a novel approach to correct human label biases in AI evaluation by integrating psychometric rater models into the AI pipeline, addressing the critical issue of systematic errors in human judgment assessments that affect the reliability of AI model training and evaluation, with their paper published on arXiv on February 26, 2026, and scheduled to be presented at the 16th Annual Learning Analytics and Knowledge Conference Workshop on LLM Psychometrics in Bergen, Norway on April 27, 2026. The research focuses on common rater effects, specifically severity and centrality biases, which distort observed ratings when humans evaluate AI outputs. These systematic errors occur when raters consistently rate all outputs higher or lower than their true quality (severity) or when they tend toward average ratings regardless of actual quality (centrality), leading to unreliable conclusions about AI model performance. The authors demonstrate how item response theory rater models, particularly the multi-faceted Rasch model, can effectively separate true output quality from rater behavior, providing more accurate assessments of AI capabilities. Using the OpenAI summarization dataset as an empirical example, the researchers show how adjusting for rater severity produces corrected estimates of summary quality while simultaneously providing diagnostic insights into individual rater performance patterns. This approach represents a significant advancement in creating more robust, interpretable, and construct-aligned practices for AI development and evaluation, enabling developers to make decisions based on statistically adjusted scores rather than raw, error-prone ratings.
🏷️ Themes
AI Evaluation, Human-Rated Data, Psychometric Modeling, Systematic Error Correction
📚 Related People & Topics
Item response theory
Paradigm for the design, analysis, and scoring of tests
In psychometrics, item response theory (IRT, also known as latent trait theory, strong true score theory, or modern mental test theory) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a th...
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.22585 [Submitted on 26 Feb 2026] Title: Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach Authors: Jodi M. Casabianca , Maggie Beiting-Parrish View a PDF of the paper titled Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach, by Jodi M. Casabianca and Maggie Beiting-Parrish View PDF HTML Abstract: Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation. Comments: 16 pages, 5 figures, 1 table; The 16th Annual Learning Analytics and Knowledge Conference Workshop on LLM Psychometrics, April 27, 2026, Bergen, Norway Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG) Cite as: arXiv:2602.22585 [cs.AI] (or arXiv:2602.22585v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.22585 Focus to learn more arXiv-issued DOI via DataCite (pending registration...
Read full article at source