No One Size Fits All: QueryBandits for Hallucination Mitigation
#QueryBandits#Hallucination Mitigation#Large Language Models#Closed-Source Models#Query Rewriting#Contextual Bandits#Thompson Sampling#AI Safety
📌 Key Takeaways
QueryBandits is a model-agnostic framework that can work with closed-source models
The approach achieved 87.5% win rate over baseline in reducing hallucinations
No single rewrite policy works optimally for all queries
Static rewriting policies can sometimes worsen hallucinations
The method works through forward-pass mechanisms without requiring model retraining
📖 Full Retelling
Researchers Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, and Manuela Veloso introduced QueryBandits, a model-agnostic contextual bandit framework designed to mitigate hallucinations in large language models, in a paper submitted to arXiv on February 23, 2026, addressing the critical gap in research on closed-source models that dominate institutional deployments. The framework represents a significant advancement in AI safety by providing a solution that works with the majority of deployed models rather than just open-source alternatives. Unlike existing approaches that focus on post-hoc detection and parameter editing, QueryBandits operates through forward-pass mechanisms, making it compatible with closed-source models that cannot be easily modified or retrained.
Across 16 question-answering scenarios, the top-performing QueryBandit implementation using Thompson Sampling achieved an impressive 87.5% win rate compared to a no-rewrite baseline and significantly outperformed static policies like paraphrasing or expanding queries. The researchers discovered that there is no single optimal rewrite policy for all queries, with certain static approaches actually worsening hallucinations compared to no rewriting at all. This finding underscores the importance of adaptive approaches in addressing the complex problem of hallucinations, which have become more frequent with advanced reasoning capabilities in modern language models.
The study also revealed that all contextual bandits outperformed vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates the researchers' core finding that different queries require different rewriting strategies. By learning an online policy over semantic features, QueryBandits can shift model behavior without requiring retraining or gradient-based adaptation, making it particularly valuable for enterprise environments where closed-source models are prevalent but difficult to modify.
🏷️ Themes
AI Safety, Machine Learning, Natural Language Processing
Query rewriting is a typically automatic transformation that takes a set of database tables, views, and/or queries, usually indices, often gathered data and query statistics, and other metadata, and yields a set of different queries, which produce the same results but execute with better performance...
A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...
No entity connections available yet for this article.
Original Source
--> Computer Science > Computation and Language arXiv:2602.20332 [Submitted on 23 Feb 2026] Title: No One Size Fits All: QueryBandits for Hallucination Mitigation Authors: Nicole Cho , William Watson , Alec Koppel , Sumitra Ganesh , Manuela Veloso View a PDF of the paper titled No One Size Fits All: QueryBandits for Hallucination Mitigation, by Nicole Cho and 4 other authors View PDF Abstract: Advanced reasoning capabilities in Large Language Models have led to more frequent hallucinations; yet most mitigation work focuses on open-source models for post-hoc detection and parameter editing. The dearth of studies focusing on hallucinations in closed-source models is especially concerning, as they constitute the vast majority of models in institutional deployments. We introduce QueryBandits, a model-agnostic contextual bandit framework that adaptively learns online to select the optimal query-rewrite strategy by leveraging an empirically validated and calibrated reward function. Across 16 QA scenarios, our top QueryBandit (Thompson Sampling) achieves an 87.5% win rate over a No-Rewrite baseline and outperforms zero-shot static policies (e.g., Paraphrase or Expand) by 42.6% and 60.3%, respectively. Moreover, all contextual bandits outperform vanilla bandits across all datasets, with higher feature variance coinciding with greater variance in arm selection. This substantiates our finding that there is no single rewrite policy optimal for all queries. We also discover that certain static policies incur higher cumulative regret than No-Rewrite, indicating that an inflexible query-rewriting policy can worsen hallucinations. Thus, learning an online policy over semantic features with QueryBandits can shift model behavior purely through forward-pass mechanisms, enabling its use with closed-source models and bypassing the need for retraining or gradient-based adaptation. Subjects: Computation and Language (cs.CL) ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite...