
In a revelation that simultaneously showcases the staggering advancement of artificial intelligence and exposes a critical vulnerability in the decentralized finance (DeFi) ecosystem, OpenAI has unveiled EVMbench, a new comprehensive testing framework designed to evaluate AI agents' capabilities in blockchain security. The results from the inaugural benchmark are as impressive as they are unsettling: OpenAI’s latest specialized model, GPT-5.3-Codex, successfully exploited and drained cryptocurrency wallets in 72.2% of the test cases, demonstrating a proficiency in cyber-offense that currently far outstrips its defensive counterparts.
Launched in collaboration with crypto investment firm Paradigm, EVMbench serves as a standardized arena to measure how well AI models can detect, patch, and exploit vulnerabilities in Ethereum Virtual Machine (EVM) smart contracts. While the initiative aims to bolster security through "red teaming," the immediate data points to a widening gap between the sword and the shield. While GPT-5.3-Codex proved itself a formidable digital predator, its ability to protect—scoring significantly lower in detection and patching tasks—has sparked urgent discussions regarding the safety of the $100 billion locked in smart contracts worldwide.
The headline statistic of a 72.2% success rate in the "Exploit" category marks a massive generational leap in AI capabilities. Just six months prior, the standard GPT-5 model achieved a mere 31.9% success rate on similar tasks. This doubling of efficacy suggests that the specialized tuning in GPT-5.3-Codex has unlocked a deeper understanding of complex logic flows and economic incentives inherent in blockchain protocols.
However, the benchmark also highlighted a concerning asymmetry. While the AI excelled at breaking systems, it struggled to fix them. In the "Patch" mode—where the agent must fix a vulnerability without breaking the contract's intended functionality—success rates hovered around 41.5%. Similarly, in "Detect" mode, which mimics a traditional code audit, models often failed to identify known bugs, with top performers like Claude Opus 4.6 managing only a 45.6% detection rate.
This disparity underscores a fundamental reality of current LLM architecture: it is computationally easier for an agent to find a single path to failure (exploitation) than to guarantee the absence of all failures (security verification). The table below illustrates the stark performance contrast across different operational modes in the new benchmark.
Table 1: AI Model Performance in EVMbench Modes
Metric|GPT-5.3-Codex (Current)|GPT-5 (6 Months Prior)|Claude Opus 4.6
---|---|----
Exploit Success Rate|72.2%|31.9%|N/A
Patch Success Rate|41.5%|N/A|N/A
Detection Recall|N/A|N/A|45.6%
To ensure these results reflect real-world risks rather than theoretical exercises, OpenAI and Paradigm constructed EVMbench using 120 curated vulnerabilities drawn from 40 professional smart contract audits. These were not synthetic bugs but actual flaws found in production code, many sourced from competitive audit platforms like Code4rena.
The benchmark operates in a sandboxed environment known as Anvil, allowing AI agents to interact with a local blockchain simulation. This isolation allows the models to attempt destructive actions—such as reentrancy attacks or logic manipulation—without risking actual user funds.
The framework evaluates agents across three distinct competencies:
Table 2: EVMbench Evaluation Modes
| Mode | Objective | Success Criteria |
|---|---|---|
| Detect | Audit a repository to find vulnerabilities. | Recall of ground-truth flaws confirmed by human auditors. |
| Patch | Rewrite code to remove the vulnerability. | Vulnerability is gone AND core functionality remains intact. |
| Exploit | Attack a deployed contract to steal funds. | Successful draining of the contract's crypto balance. |
Crucially, the benchmark includes scenarios from the Tempo blockchain, a new Layer-1 network developed by Stripe and Paradigm focused on high-throughput stablecoin payments. The inclusion of Tempo-specific challenges indicates that OpenAI is not just looking at legacy Ethereum code but is actively testing against next-generation infrastructure where agentic payments are expected to proliferate.
Perhaps the most alarming anecdote from the accompanying research paper involves a specific test case where an agent powered by GPT-5.2 (an intermediate version) executed a complex "flash loan" attack.
Flash loan attacks are sophisticated financial exploits that require borrowing a massive amount of capital, using it to manipulate market prices or protocol logic, and repaying the loan within a single transaction block. They are typically the domain of elite human hackers due to the precise sequencing required.
In the EVMbench test, the AI agent:
It achieved this without human guidance, step-by-step instructions, or prior examples of this specific contract's architecture. This capability signals that autonomous agents are moving beyond simple pattern matching into multi-step strategic reasoning, a development that poses existential risks to poorly audited decentralized finance (DeFi) protocols.
Recognizing the potential for these tools to be weaponized, OpenAI is framing the release of EVMbench and GPT-5.3-Codex as a "defensive imperative." The logic is that by placing these powerful offensive tools in the hands of "white hat" security researchers, vulnerabilities can be found and fixed before malicious actors exploit them.
To support this defensive ecosystem, OpenAI announced the Cybersecurity Grant Program, pledging $10 million in API credits to developers and researchers working on open-source defense tools. The goal is to lower the barrier to entry for automated auditing, allowing even small projects to access state-of-the-art security checks.
Furthermore, the company is expanding the private beta of Aardvark, a dedicated security research agent. Unlike the general-purpose Codex models, Aardvark is trained specifically on security literature, audit reports, and formal verification methods. Early internal tests suggest that Aardvark may help close the gap between offense and defense, utilizing the "attacker mindset" of GPT-5.3 to predict exploits and proactively suggest patches.
The release of EVMbench comes at a pivotal moment for the crypto industry, following a series of high-profile exploits, including the recent $2.7 million loss in the Moonwell protocol due to a bug in AI-generated code. The industry is currently grappling with a double-edged sword: AI is increasingly used to write smart contracts, often introducing subtle bugs, while simultaneously being the only tool scalable enough to audit the exploding volume of blockchain code.
Paradigm’s involvement suggests that major institutional players view AI security not as a luxury but as a prerequisite for the mass adoption of stablecoins and decentralized financial rails. If AI agents are to handle autonomous payments on networks like Tempo, they must be resilient against adversarial AI trying to rob them.
Experts warn that the "72% exploit rate" is likely a floor, not a ceiling. As models continue to scale and utilize techniques like "Chain-of-Thought" reasoning during inference, their ability to find obscure "black swan" vulnerabilities will likely increase.
For now, the message to smart contract developers is clear: The AI that helps you write your code is also capable of robbing you. Until defensive capabilities catch up, the only safe path is rigorous, human-led auditing, augmented—but not replaced—by the very AI tools that threaten the system.