4/9/2026 | USA | technology | ✓ Verified - arxiv.org

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

#prompt injection #large language model #AI security #wrapper defense #impossibility theorem #adversarial attack #arXiv

📌 Key Takeaways

A mathematical proof shows perfect, continuous "wrapper" defenses against prompt injection for LLMs are impossible.
The defense must fail at specific points, leaving some adversarial inputs unchanged ("boundary fixation").
Defenders face a trilemma: they cannot have perfect safety, full utility, and a continuous defense simultaneously.
The finding challenges the security of many current LLM deployments that rely on input preprocessing.

📖 Full Retelling

A team of computer science researchers has published a formal mathematical proof demonstrating the fundamental impossibility of creating perfect, continuous "wrapper" defenses against prompt injection attacks for large language models (LLMs), in a paper uploaded to the arXiv preprint server on April 24, 2026. The research establishes that any defense which attempts to pre-process user inputs to filter malicious prompts before they reach the model is mathematically guaranteed to fail in specific, predictable ways, provided it aims to preserve the model's core utility and operates continuously over a connected space of possible prompts. The paper, titled "The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?", presents a series of three impossibility theorems under progressively stronger assumptions. The first result, termed "boundary fixation," shows that any such continuous defense must leave certain "threshold-level" adversarial inputs completely unchanged, allowing them to pass through to the model unaltered. The subsequent findings further characterize the precise failure points, indicating that defenders face an unavoidable trade-off: they cannot simultaneously achieve perfect safety, preserve the model's full functionality on benign tasks, and maintain a seamless, continuous defense mechanism. This theoretical breakthrough has significant implications for the rapidly evolving field of AI security. Prompt injection, where a user crafts an input to hijack a model's instructions and make it perform unauthorized actions, is a critical vulnerability for LLMs deployed in applications like chatbots, coding assistants, and automated agents. The research suggests that relying solely on input-filtering "wrappers" is a flawed strategy. Instead, the authors imply that robust security will require a multi-layered approach, potentially involving techniques like output filtering, runtime monitoring, architectural changes to the models themselves, or formally verified, non-continuous defense systems that explicitly manage their failure modes.

🏷️ Themes

AI Security, Theoretical Computer Science, LLM Vulnerabilities

Entity Intersection Graph

No entity connections available yet for this article.

}

Original Source

              arXiv:2604.06436v1 Announce Type: cross 
Abstract: We prove that no continuous, utility-preserving wrapper defense-a function $D: X\to X$ that preprocesses inputs before the model sees them-can make all outputs strictly safe for a language model with connected prompt space, and we characterize exactly where every such defense must fail. We establish three results under successively stronger hypotheses: boundary fixation-the defense must leave some threshold-level inputs unchanged; an $\epsilon$-
            

Read full article at source

Source

arxiv.org

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

Entity Intersection Graph

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine