Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing
#motivated reasoning #activation probing #chain-of-thought #AI rationalization #bias detection #model interpretability #neural networks
๐ Key Takeaways
- Researchers developed a method to detect motivated reasoning in AI models using activation probing.
- The technique identifies rationalization both before and after chain-of-thought (CoT) reasoning processes.
- It aims to uncover biases where models justify predetermined conclusions rather than reasoning objectively.
- Activation probing provides insights into internal model states to flag instances of motivated reasoning.
๐ Full Retelling
arXiv:2603.17199v1 Announce Type: cross
Abstract: Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM familie
๐ท๏ธ Themes
AI Bias, Reasoning Detection
Entity Intersection Graph
No entity connections available yet for this article.
Original Source
arXiv:2603.17199v1 Announce Type: cross
Abstract: Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM familie
Read full article at source