AI News

A New Era for Blockchain Security: OpenAI and Paradigm Unveil EVMbench

In a decisive move to fortify the intersection of artificial intelligence and decentralized finance, OpenAI has announced a strategic partnership with crypto investment firm Paradigm. The collaboration introduces EVMbench, a comprehensive benchmark designed to evaluate the capabilities of AI agents in detecting, patching, and exploiting smart contract vulnerabilities.

As of February 2026, the crypto ecosystem secures over $100 billion in open-source assets, making it a lucrative target for malicious actors. The release of EVMbench represents a critical shift from theoretical AI application to practical, rigorous testing in economically meaningful environments. By providing a standardized framework, OpenAI and Paradigm aim to accelerate the development of defensive AI systems capable of auditing and strengthening code before it reaches the mainnet.

This initiative underscores a growing recognition that as AI agents become proficient at reading and writing code, they must be rigorously tested against the specific, high-stakes constraints of the Ethereum Virtual Machine (EVM).

Deconstructing EVMbench: The Trinity of Security Tasks

EVMbench is not merely a dataset but a dynamic evaluation environment. It moves beyond static code analysis by immersing AI agents in a sandboxed blockchain environment where they must interact with live bytecode. The benchmark evaluates agents across three distinct but interconnected capability modes, each mimicking a critical phase in the lifecycle of smart contract security.

1. Detect: The Digital Auditor

In the detection mode, agents are tasked with auditing a smart contract repository. The objective is to identify ground-truth vulnerabilities—those that have been confirmed by human auditors—and flag them accurately. Agents are scored based on their "recall," or the percentage of known vulnerabilities they successfully identify. This mode challenges the AI's ability to understand complex logic flows and recognize patterns indicative of security flaws, such as reentrancy attacks or integer overflows.

2. Patch: The Surgical Fix

Perhaps the most complex of the three, the patch mode requires agents to not only find a vulnerability but to fix it. The constraints here are significant: the agent must modify the vulnerable contract to eliminate the exploit while preserving the original intended functionality. This is verified through a suite of automated tests. If an agent "fixes" a bug but inadvertently breaks the contract's core logic or introduces compilation errors, the attempt is marked as a failure. This mimics the real-world pressure on developers to apply hotfixes without disrupting protocol operations.

3. Exploit: The Red Teamer

In this mode, agents act as attackers. They are given a deployed contract in a sandboxed environment and must execute an end-to-end attack to drain funds. Grading is performed programmatically via transaction replay and on-chain verification. This mode is critical for "Red Teaming"—using AI to simulate attacks so that defenses can be battle-tested against the most creative adversarial strategies.

The Dataset: Rooted in Reality

To ensure the benchmark reflects real-world risks, OpenAI and Paradigm curated 120 high-severity vulnerabilities from 40 different audits. The majority of these were sourced from open code audit competitions, such as Code4rena, which are known for surfacing subtle and high-impact bugs.

A notable addition to the dataset includes vulnerability scenarios drawn from the security auditing process for the Tempo blockchain. Tempo is a purpose-built Layer 1 blockchain designed for high-throughput, low-cost stablecoin payments. By including scenarios from Tempo, EVMbench extends its reach into payment-oriented smart contract code, a domain expected to see massive growth as agentic stablecoin payments become commonplace.

The technical infrastructure powering EVMbench is equally robust. It utilizes a Rust-based harness that deploys contracts and replays agent transactions deterministically. To prevent accidental harm, exploit tasks run in an isolated local Anvil environment rather than on live networks, ensuring that the testing ground is safe, reproducible, and contained.

Benchmarking the Frontier: GPT-5.3 Takes the Lead

The launch of EVMbench has provided the first public insights into how the latest generation of AI models performs in the crypto-security domain. OpenAI utilized the benchmark to test its frontier agents, revealing a significant leap in capabilities over the last six months.

The performance metrics highlight a dramatic improvement in "offensive" capabilities, specifically in the exploit mode. The data shows that the latest iteration of OpenAI's coding model, GPT-5.3-Codex, vastly outperforms its predecessor.

Table 1: Comparative Performance in Exploit Mode

Model Version Execution Environment Exploit Success Rate
GPT-5.3-Codex Codex CLI 72.2%
GPT-5 Standard 31.9%
GPT-4o (Reference) Standard < 15.0%

The jump from a 31.9% success rate with GPT-5 to 72.2% with GPT-5.3-Codex indicates that AI agents are becoming exceptionally proficient at identifying and executing exploit paths when given a clear, explicit objective (e.g., "drain funds").

The Offensive-Defensive Gap

However, the benchmark also revealed a persistent gap between offensive and defensive capabilities. While agents excelled at the Exploit task, their performance on Detect and Patch tasks remained lower.

  • Detection Challenges: Agents often stopped auditing after finding a single issue, failing to perform the exhaustive review required to certify a codebase as safe.
  • Patching Complexities: The requirement to maintain full functionality while removing subtle bugs proved difficult. Agents frequently generated patches that fixed the security flaw but broke the contract's intended utility—a "cure is worse than the disease" scenario that is unacceptable in production environments.

Strategic Implications for the Crypto Industry

The collaboration between OpenAI and Paradigm signals a maturing of the "AI x Crypto" narrative. Paradigm, known for its deep technical expertise and research-first approach to crypto investing, provided the domain knowledge necessary to ensure the benchmark's tasks were not just syntactically correct, but semantically meaningful to blockchain developers.

By releasing EVMbench's tasks, tooling, and evaluation framework as open source, the partners are effectively issuing a "call to arms" for the developer community. The goal is to democratize access to high-level security tools, allowing individual developers and small teams to audit their smart contracts with the same rigor as top-tier security firms.

Expanding the Defensive Toolkit: Project Aardvark

In conjunction with the benchmark release, OpenAI announced the expansion of the private beta for Aardvark, their dedicated security research agent. Aardvark represents the practical application of the insights gained from EVMbench—an AI agent specifically fine-tuned for defensive security tasks.

Furthermore, OpenAI is committing $10 million in API credits to accelerate cyber defense research. This grant program focuses on applying the company's most capable models to protect open-source software and critical infrastructure systems, ensuring that the benefits of AI security are distributed widely across the ecosystem.

The Road Ahead

The introduction of EVMbench serves as both a measurement tool and a warning. The rapid improvement in AI's ability to exploit contracts (evidenced by the 72.2% success rate of GPT-5.3-Codex) suggests that the window for "security by obscurity" is closing fast. As AI agents become more capable attackers, defensive tools must evolve at an equal or greater velocity.

For the blockchain industry, this means that AI-assisted auditing will soon graduate from a luxury to a necessity. Future iterations of EVMbench may expand to include multi-chain environments, cross-bridge vulnerabilities, and more complex social engineering attacks, mirroring the evolving threat landscape of Web3.

As we move deeper into 2026, the synergy between OpenAI's reasoning engines and Paradigm's crypto-native insights sets a new standard for how we approach digital trust. The question is no longer if AI will be used to secure smart contracts, but how quickly the industry can adopt these benchmarks to stay ahead of the next generation of automated threats.

Featured