SP
BravenNow
CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
| USA | technology | ✓ Verified - arxiv.org

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

#CourtGuard #Large Language Models #Zero-Shot Adaptation #AI Safety #Retrieval-Augmented #Multi-Agent Framework #Policy Governance #Evidentiary Debate

📌 Key Takeaways

  • CourtGuard is a retrieval-augmented multi-agent framework for LLM safety
  • It reimagines safety evaluation as an 'Evidentiary Debate' process
  • Achieves state-of-the-art performance across 7 safety benchmarks without fine-tuning
  • Demonstrates zero-shot adaptability to new tasks by swapping reference policies
  • Enables automated data curation and auditing of adversarial attacks

📖 Full Retelling

Researchers Umid Suleymanov and six colleagues introduced CourtGuard, a retrieval-augmented multi-agent framework for improving Large Language Model safety, in a paper submitted to arXiv on February 26, 2026, addressing the critical issue of adaptation rigidity in current AI safety mechanisms that require expensive retraining to implement new governance rules. The paper highlights that current safety mechanisms for Large Language Models rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity—the inability to enforce new governance rules without expensive retraining. CourtGuard addresses this limitation by reimagining safety evaluation as an 'Evidentiary Debate' through an adversarial debate process grounded in external policy documents. This innovative approach achieves state-of-the-art performance across seven safety benchmarks, outperforming dedicated policy-following baselines without requiring fine-tuning. The researchers emphasize two critical capabilities of CourtGuard: Zero-Shot Adaptability and Automated Data Curation and Auditing. In the first case, the framework successfully generalized to an out-of-domain Wikipedia Vandalism task with 90% accuracy simply by swapping the reference policy. In the second case, the researchers leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. The results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

🏷️ Themes

AI Safety, Machine Learning Frameworks, Policy Adaptation

📚 Related People & Topics

Policy Governance

System for organizational governance

Policy Governance, informally known as the Carver model, is a system for organizational governance. Policy Governance defines and guides appropriate relationships between an organization's owners (also with non-legal 'moral owners'), board of directors, and chief executive. The system is built on 10...

View Profile → Wikipedia ↗

Large language model

Type of machine learning model

A large language model (LLM) is a language model trained with self-supervised machine learning on a vast amount of text, designed for natural language processing tasks, especially language generation. The largest and most capable LLMs are generative pre-trained transformers (GPTs) that provide the c...

View Profile → Wikipedia ↗

Entity Intersection Graph

No entity connections available yet for this article.

Original Source
--> Computer Science > Artificial Intelligence arXiv:2602.22557 [Submitted on 26 Feb 2026] Title: CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety Authors: Umid Suleymanov , Rufiz Bayramov , Suad Gafarli , Seljan Musayeva , Taghi Mammadov , Aynur Akhundlu , Murat Kantarcioglu View a PDF of the paper titled CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety, by Umid Suleymanov and 6 other authors View PDF HTML Abstract: Current safety mechanisms for Large Language Models rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy 2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance. Comments: Under Review Subjects: Artificial Intelligence (cs.AI) ; Machine Learning (cs.LG) Cite as: arXiv:2602.22557 [cs.AI] (or arXiv:2602.22557v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2602.22557 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Umid Suleymanov [ view...
Read full article at source

Source

arxiv.org

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine