Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering
#Large language models#LLMs#Code QA#Long‑context#Robustness#Reasoning fidelity#Benchmark#LongCodeBench#COBOL#Java#Multiple-choice#Open-ended#Distractors#Needle‑in‑a‑haystack
📌 Key Takeaways
Large language models (LLMs) are increasingly employed for software engineering tasks that require reasoning over extensive code contexts, yet their resilience to varying inputs remains poorly understood.
The authors conduct a controlled ablation study, systematically varying answer format, distractors, and context scale to probe LLM sensitivity.
LongCodeBench is expanded to include new COBOL and Java question–answer sets, providing a more diverse evaluation corpus.
Models are evaluated across three experimental settings: shuffled multiple‑choice options, open‑ended queries, and needle‑in‑a‑haystack contexts containing both relevant and irrelevant information.
Results demonstrate significant performance degradation in shuffled multiple‑choice and open‑ended scenarios, and brittle behavior when irrelevant cues are present.
The study highlights the shortcomings of current long‑context evaluation protocols and offers an extended benchmark for assessing code‑reasoning capabilities in both legacy and modern programming languages.
📖 Full Retelling
Kishan Maharaj, Nandakishore Menon, Ashita Saxena, and Srikanth Tamilselvam publish a systematic study titled "Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering" on 19 Feb 2026 on arXiv. The authors evaluate how well state‑of‑the‑art large language models perform on code‑question answering tasks that involve long code contexts, and why current benchmarks may not reflect real‑world robustness. They extend the LongCodeBench dataset with new COBOL and Java question‑answer pairs, then test models under shuffled multiple‑choice, open‑ended, and needle‑in‑a‑haystack settings, ultimately revealing substantial performance drops and brittleness to irrelevant information.
🏷️ Themes
Software engineering, Artificial intelligence, Code question answering, Large language model evaluation, Robustness and reliability, Benchmark creation, Long‑context reasoning
Entity Intersection Graph
No entity connections available yet for this article.
Deep Analysis
Why It Matters
The study reveals that large language models struggle with long-code reasoning when faced with shuffled options, open-ended queries, or irrelevant context, highlighting gaps in current benchmarks.
Context & Background
Large language models are increasingly used for software engineering tasks that require code reasoning.
Existing evaluations often ignore the impact of distractors and context scale on model performance.
The paper extends the LongCodeBench dataset to include COBOL and Java, providing a more comprehensive benchmark.
What Happens Next
Future research may focus on developing more robust evaluation protocols and improving model architectures to handle distractors and large contexts. The benchmark could be adopted by the community to guide model development.
Frequently Asked Questions
What is the main finding of the paper?
Models show significant performance drops when multiple-choice options are shuffled or when presented with open-ended questions, indicating brittleness to input format changes.
Which programming languages are covered in the extended benchmark?
The extended benchmark includes Python, COBOL, and Java question-answer sets.
How can developers use these findings?
By incorporating the new benchmark into their testing pipelines to evaluate and improve code reasoning capabilities of their models.
}
Original Source
--> Computer Science > Software Engineering arXiv:2602.17183 [Submitted on 19 Feb 2026] Title: Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering Authors: Kishan Maharaj , Nandakishore Menon , Ashita Saxena , Srikanth Tamilselvam View a PDF of the paper titled Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering, by Kishan Maharaj and 3 other authors View PDF HTML Abstract: Large language models increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context code question answering using controlled ablations that test sensitivity to answer format, distractors, and context scale. Extending LongCodeBench Python dataset with new COBOL and Java question-answer sets, we evaluate state-of-the-art models under three settings: shuffled multiple-choice options, open-ended questions and needle-in-a-haystack contexts containing relevant and adversarially irrelevant information. Results show substantial performance drops in both shuffled multiple-choice options and open-ended questions, and brittle behavior in the presence of irrelevant cues. Our findings highlight limitations of current long-context evaluations and provide a broader benchmark for assessing code reasoning in both legacy and modern systems. Comments: 11 pages, 4 Figures, 5 Tables, Work in Progress Subjects: Software Engineering (cs.SE) ; Artificial Intelligence (cs.AI) Cite as: arXiv:2602.17183 [cs.SE] (or arXiv:2602.17183v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2602.17183 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Kishan Maharaj [ view email ] [v1] Thu, 19 Feb 2026 09:05:03 UTC (2,066 KB) Full-text links: Access Paper: View a PDF of the paper titled Robustness and Reasoning Fidelity of Large Language Mo...