What is key point 1 about "Introducing EVMbench"?

OpenAI and Paradigm launched EVMbench, a benchmark for evaluating AI agents' smart contract security capabilities

What is key point 2 about "Introducing EVMbench"?

EVMbench tests AI agents across three modes: Detect, Patch, and Exploit vulnerabilities

What is key point 3 about "Introducing EVMbench"?

GPT-5.3-Codex achieved 72.2% in exploit mode, showing significant improvement over previous models

What is key point 4 about "Introducing EVMbench"?

The benchmark includes 120 curated vulnerabilities from 40 audits and real blockchain scenarios

What is key point 5 about "Introducing EVMbench"?

AI performance varies significantly across different testing modes, with exploit mode showing best results

2/18/2026 | USA | technology | ✓ Verified - openai.com

Introducing EVMbench

#EVMbench #AI agents #smart contracts #vulnerability detection #OpenAI #Paradigm #blockchain security #GPT-5.3-Codex

📌 Key Takeaways

OpenAI and Paradigm launched EVMbench, a benchmark for evaluating AI agents' smart contract security capabilities
EVMbench tests AI agents across three modes: Detect, Patch, and Exploit vulnerabilities
GPT-5.3-Codex achieved 72.2% in exploit mode, showing significant improvement over previous models
The benchmark includes 120 curated vulnerabilities from 40 audits and real blockchain scenarios
AI performance varies significantly across different testing modes, with exploit mode showing best results

📖 Full Retelling

OpenAI and Paradigm jointly introduced EVMbench on February 18, 2026, a comprehensive benchmark designed to evaluate AI agents' capabilities in detecting, patching, and exploiting high-severity smart contract vulnerabilities in blockchain environments. This initiative comes as smart contracts routinely secure over $100 billion in open-source crypto assets, making it increasingly critical to measure AI performance in economically significant settings while promoting defensive AI applications for auditing and strengthening deployed contracts. EVMbench incorporates 120 curated vulnerabilities sourced from 40 audits, primarily from open code audit competitions, along with additional scenarios from the security auditing process for the Tempo blockchain. The benchmark evaluates AI agents across three distinct capability modes: Detect, where agents audit smart contract repositories and are scored on recall of vulnerabilities; Patch, where agents modify vulnerable contracts while preserving intended functionality; and Exploit, where agents execute end-to-end fund-draining attacks against deployed contracts in a sandboxed blockchain environment. The evaluation is supported by a Rust-based harness that deploys contracts, replays agent transactions deterministically, and restricts unsafe RPC methods. In initial testing, GPT-5.3-Codex achieved a score of 72.2% in the 'exploit' mode, demonstrating significant improvement over previous models like GPT-5, which scored only 31.9% when released six months prior. However, performance in detect and patch modes remains suboptimal, revealing interesting differences in model behavior across tasks. While agents excel in the exploit setting where objectives are explicit, they struggle with exhaustive auditing in detect mode and maintaining full functionality while removing subtle vulnerabilities in patch mode. The release of EVMbench represents not just a measurement tool but also a call to action for developers and security researchers to incorporate AI-assisted auditing into their workflows as AI capabilities continue to advance.

🏷️ Themes

AI Security, Blockchain Technology, Cybersecurity

📚 Related People & Topics

OpenAI

Artificial intelligence research organization

# OpenAI **OpenAI** is an American artificial intelligence (AI) research organization headquartered in San Francisco, California. The organization operates under a unique hybrid structure, comprising the non-profit **OpenAI, Inc.** and its controlled for-profit subsidiary, **OpenAI Global, LLC** (a...

View Profile → Wikipedia ↗

Paradigm

Set of distinct concepts or thought patterns

In science and philosophy, a paradigm ( PARR-ə-dyme) is a distinct set of concepts or thought patterns, including theories, research methods, postulates, and standards for what constitute legitimate contributions to a field. The word paradigm is Greek in origin, meaning "pattern". It is closely rela...

View Profile → Wikipedia ↗

AI agent

Systems that perform tasks without human intervention

In the context of generative artificial intelligence, AI agents (also referred to as compound AI systems or agentic AI) are a class of intelligent agents distinguished by their ability to operate autonomously in complex environments. Agentic AI tools prioritize decision-making over content creation ...

View Profile → Wikipedia ↗

Entity Intersection Graph

Connections for OpenAI:

🌐 Artificial intelligence 10 shared

🌐 ChatGPT 8 shared

👤 Wall Street 4 shared

🏢 Nvidia 4 shared

🏢 Anthropic 3 shared

View full profile

Deep Analysis

Why It Matters

Smart contracts secure over $100 billion in crypto assets, and AI agents are becoming increasingly capable of interacting with code. EVMbench provides a crucial way to measure AI's ability to find and fix vulnerabilities, which is essential for tracking cyber risks and promoting the defensive use of AI to strengthen security.

Context & Background

Smart contracts manage vast sums of value and are a frequent target for exploits
AI models are rapidly improving at understanding and writing code
The benchmark was developed in collaboration with Paradigm and uses 120 curated vulnerabilities from real audits

What Happens Next

The EVMbench framework is being released to support ongoing research into AI cyber capabilities. The team is also expanding safety measures, including a $10 million API credit program for defensive security research and partnerships to provide free code scanning for open-source projects.

Frequently Asked Questions

What is EVMbench?

EVMbench is a benchmark that evaluates AI agents' abilities to detect, patch, and exploit vulnerabilities in Ethereum smart contracts.

How well do current AI models perform?

In exploit tasks, GPT-5.3-Codex scored 72.2%, a significant improvement over previous models, but performance on detection and patching tasks remains lower.

What are the limitations of EVMbench?

It uses historical vulnerabilities from competitions, not live mainnet contracts, and its grading cannot verify new vulnerabilities found by AI beyond the known ground truth.

Original Source

              February 18, 2026 Research Publication Introducing EVMbench Making smart contracts safer by evaluating AI agents’ ability to detect, patch, and exploit vulnerabilities in blockchain environments. Read the paper (opens in a new window) Loading… Share Smart contracts routinely secure $100B+ in open-source crypto assets. As AI agents improve at reading, writing, and executing code, it becomes increasingly important to measure their capabilities in economically meaningful environments, and to encourage the use of AI systems defensively to audit and strengthen deployed contracts. Together with Paradigm ⁠ (opens in a new window) , we’re introducing EVMbench, a benchmark evaluating the ability of AI agents to detect, patch, and exploit high-severity smart contract vulnerabilities. EVMbench draws on 120 curated vulnerabilities from 40 audits, with most sourced from open code audit competitions. EVMbench additionally includes several vulnerability scenarios drawn from the security auditing process for the Tempo ⁠ (opens in a new window) blockchain, a purpose-built L1 designed to enable high-throughput, low-cost payments via stablecoins. These scenarios extend the benchmark into payment-oriented smart contract code, where we expect agentic stablecoin payments to grow, and help ground it in a domain of emerging practical importance. To create our task environments, we adapted existing proof-of-concept exploit tests and deployment scripts, when they existed, and otherwise manually wrote them. For the patch mode, we ensured that the vulnerabilities are exploitable and that can be mitigated without introducing compilation-breaking changes, which would compromise our setup. For the exploit mode, we wrote custom graders and red-teamed the environments in an attempt to find and patch methods by which an agent might cheat the grader. In addition to task quality control via domain expertise provided by Paradigm, we used automated task auditing agents to help increase the soundness of ...
            

Read full article at source

Source

openai.com

Introducing EVMbench

📌 Key Takeaways

📖 Full Retelling

🏷️ Themes

📚 Related People & Topics

OpenAI

Paradigm

AI agent

Entity Intersection Graph

Deep Analysis

Why It Matters

Context & Background

What Happens Next

Frequently Asked Questions

Source

More from USA

News from Other Countries

🇬🇧 United Kingdom

🇺🇦 Ukraine