X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes
#X-RAY #LLM #reasoning #evaluation #probes #calibration #formalized #capability mapping
π Key Takeaways
- X-RAY is a new method for evaluating LLM reasoning using formalized and calibrated probes.
- It maps reasoning capabilities by systematically testing models across structured tasks.
- The approach aims to provide more accurate and interpretable assessments of LLM performance.
- It addresses limitations in existing evaluation methods by introducing calibration techniques.
π Full Retelling
arXiv:2603.05290v1 Announce Type: new
Abstract: Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{struct
π·οΈ Themes
AI Evaluation, Reasoning Capabilities
π Related People & Topics
Large language model
Type of machine learning model
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
Entity Intersection Graph
Connections for Large language model:
π
Artificial intelligence
3 shared
π
Reinforcement learning
3 shared
π
Educational technology
2 shared
π
Benchmark
2 shared
π’
OpenAI
2 shared
Mentioned Entities
Original Source
--> Computer Science > Artificial Intelligence arXiv:2603.05290 [Submitted on 5 Mar 2026] Title: X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes Authors: Gao Tianxi , Cai Yufan , Yuan Yusi , Dong Jin Song View a PDF of the paper titled X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes, by Gao Tianxi and 3 other authors View PDF HTML Abstract: Large language models achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit , operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2603.05290 [cs.AI] (or arXiv:2603.05290v1 [cs....
Read full article at source