SP
BravenNow
ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making
| USA | technology | ✓ Verified - arxiv.org

ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making

#Large Language Models #Clinical Decision-Making #Incomplete Information #Judgment Determinability #AI Safety #Medical AI #Benchmark Evaluation

📌 Key Takeaways

  • Researchers developed ClinDet-Bench to evaluate LLMs' judgment determinability in clinical decision-making
  • Current LLMs fail to properly handle incomplete information, making both premature judgments and excessive abstention
  • Existing benchmarks are insufficient for evaluating AI safety in clinical settings
  • The benchmark has applications beyond medicine to other high-stakes domains

📖 Full Retelling

A team of researchers led by Yusuke Watanabe developed ClinDet-Bench, a new benchmark for evaluating large language models in clinical decision-making scenarios, which they submitted to arXiv on February 26, 2026. The research addresses critical challenges in AI-assisted healthcare by focusing on how LLMs handle incomplete information when making clinical judgments, a crucial capability that affects patient safety. The benchmark specifically evaluates whether AI systems can determine when available information is sufficient for making appropriate medical decisions, rather than jumping to premature conclusions or abstaining unnecessarily. Clinical decision-making often requires judgments under incomplete information, where medical experts must carefully assess whether available data is adequate for reaching a conclusion. Both premature conclusions and excessive abstention can have serious consequences for patient safety. The researchers developed ClinDet-Bench based on clinical scoring systems, decomposing incomplete-information scenarios into determinable and undeterminable conditions. This approach requires evaluating all possible hypotheses about missing information—including unlikely ones—and verifying whether conclusions remain valid across these scenarios. The researchers discovered that despite recent advances in LLM capabilities, current models struggle with identifying determinability under incomplete information, producing both premature judgments and excessive abstention. Notably, these same models can correctly explain underlying clinical knowledge and perform well when complete information is available. This gap highlights limitations in existing benchmarks for evaluating AI safety in clinical settings. ClinDet-Bench provides a more comprehensive framework for evaluating determinability recognition, which could lead to more appropriate abstention decisions by AI systems.

🏷️ Themes

Artificial Intelligence, Medical Technology, Decision Science

📚 Related People & Topics

Large language model

Type of machine learning model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for Large language model:

🌐 Educational technology 4 shared
🌐 Reinforcement learning 3 shared
🌐 Machine learning 2 shared
🌐 Artificial intelligence 2 shared
🌐 Benchmark 2 shared
View full profile
Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.22771 [Submitted on 26 Feb 2026] Title: ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making Authors: Yusuke Watanabe , Yohei Kobashi , Takeshi Kojima , Yusuke Iwasawa , Yasushi Okuno , Yutaka Matsuo View a PDF of the paper titled ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making, by Yusuke Watanabe and 5 other authors View PDF HTML Abstract: Clinical decisions are often required under incomplete information. Clinical experts must identify whether available information is sufficient for judgment, as both premature conclusion and unnecessary abstention can compromise patient safety. To evaluate this capability of large language models , we developed ClinDet-Bench, a benchmark based on clinical scoring systems that decomposes incomplete-information scenarios into determinable and undeterminable conditions. Identifying determinability requires considering all hypotheses about missing information, including unlikely ones, and verifying whether the conclusion holds across them. We find that recent LLMs fail to identify determinability under incomplete information, producing both premature judgments and excessive abstention, despite correctly explaining the underlying scoring knowledge and performing well under complete information. These findings suggest that existing benchmarks are insufficient to evaluate the safety of LLMs in clinical settings. ClinDet-Bench provides a framework for evaluating determinability recognition, leading to appropriate abstention, with potential applicability to medicine and other high-stakes domains, and is publicly available. Comments: 17 pages, 3 figures, 10 tables Subjects: Artificial Intelligence (cs.AI) cs.DB) Cite as: arXiv:2602.22771 [cs.AI] (or arXiv:2602.22771v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.22771 Focus to learn more arXiv-issued DOI via...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine