2/20/2026 | USA | technology | ✓ Verified - arxiv.org

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

#RFEval #reasoning faithfulness #stance consistency #causal influence #large reasoning models #counterfactual intervention #benchmark #accuracy vs faithfulness #RL fine‑tuning #AI trustworthiness #ICLR 2026 #arXiv

📌 Key Takeaways

Introduced a formal framework for reasoning faithfulness based on stance consistency and causal influence.
Created RFEval, a benchmark with 7,186 instances over seven tasks using counterfactual output‑level interventions.
Evaluated twelve open‑source large reasoning models, finding 49.7% unfaithful outputs, mostly from stance inconsistency.
Discovered that unfaithfulness clusters in brittle, convergent domains such as math and code and correlates more with post‑training regimes than with model scale.
Showed that adding RL‐style objectives after supervised fine‑tuning can reduce faithfulness even while maintaining accuracy.
Established that accuracy and faithfulness are weakly correlated and statistically insignificant once model and task are controlled for.
Provided a rigorous methodology for auditing LRM reliability, emphasizing the optimization of reasoning structure alongside outcome correctness.

📖 Full Retelling

On 19 Feb 2026, researchers Yunseok Han, Yejoon Lee, and Jaeyoung Do published a paper on arXiv (cs.AI) titled *RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models*. They introduced a formal framework for assessing reasoning faithfulness in large reasoning models, presenting a new benchmark of 7,186 instances across seven tasks that probes faithfulness through counterfactual output‑level interventions. The study evaluates twelve open‑source large reasoning models, finding that nearly half of their outputs are unfaithful, primarily due to stance inconsistency, and that faithfulness is more closely linked to post‑training regimes than to model size. The paper demonstrates that accuracy is neither a reliable proxy for faithfulness nor necessarily correlated with it, underscoring the need to optimize models for both correct outcomes and the structural integrity of their reasoning processes.

🏷️ Themes

AI reliability, reasoning faithfulness, model auditing, benchmarking, counterfactual reasoning interventions, large reasoning models, post‑training effects

Entity Intersection Graph

No entity connections available yet for this article.

Deep Analysis

Why It Matters

RFEval introduces a new way to measure how faithfully large reasoning models link their reasoning to answers, revealing that many models produce plausible but unfaithful rationales. This matters because it shows that accuracy alone cannot guarantee trustworthy AI and highlights the need for new training objectives.

Context & Background

Large reasoning models often produce rationales that appear correct but may not reflect true decision process.
Existing benchmarks focus on accuracy, not faithfulness.
RFEval uses counterfactual interventions to test stance consistency and causal influence.

What Happens Next

Researchers will likely adopt RFEval to audit and improve model training, potentially integrating faithfulness metrics into fine tuning pipelines. The community may also develop new objectives that preserve accuracy while enhancing reasoning integrity.

Frequently Asked Questions

What is reasoning faithfulness?

It is the degree to which a model's stated reasoning actually drives its final answer, measured by stance consistency and causal influence.

How does RFEval test faithfulness?

By applying output level counterfactual interventions that alter the reasoning and observing whether the answer changes accordingly.

What did the study find about faithfulness vs accuracy?

The study found that nearly half of model outputs were unfaithful, and the link between accuracy and faithfulness was weak, meaning accurate answers can still come from flawed reasoning.

}

Original Source

              --> Computer Science > Artificial Intelligence arXiv:2602.17053 [Submitted on 19 Feb 2026] Title: RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models Authors: Yunseok Han , Yejoon Lee , Jaeyoung Do View a PDF of the paper titled RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models, by Yunseok Han and 2 other authors View PDF HTML Abstract: Large Reasoning Models exhibit strong performance, yet often produce rationales that sound plausible but fail to reflect their true decision process, undermining reliability and trust. We introduce a formal framework for reasoning faithfulness, defined by two testable conditions: stance consistency (a coherent stance linking reasoning to answer) and causal influence (the stated reasoning causally drives the answer under output-level interventions), explicitly decoupled from accuracy. To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions. Evaluating twelve open-source LRMs, we find unfaithfulness in 49.7% of outputs, predominantly from stance inconsistency. Failures are concentrated in brittle, convergent domains such as math and code, and correlate more with post-training regimes than with scale: within-family ablations indicate that adding current RL-style objectives on top of supervised fine-tuning can reduce reasoning faithfulness, even when accuracy is maintained. Crucially, accuracy is neither a sufficient nor a reliable proxy for faithfulness: once controlling for model and task, the accuracy-faithfulness link is weak and statistically insignificant. Our work establishes a rigorous methodology for auditing LRM reliability and shows that trustworthy AI requires optimizing not only for correct outcomes but also for the structural integrity of the reasoning process. Our code and dataset ...
            

Read full article at source

Source

arxiv.org