We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

Quesma Labs put top AI models like Claude and Gemini to the test, challenging them to find cleverly hidden backdoors in ~40MB compiled binaries using tools like Ghidra. While the best model achieved a 49% detection rate, the high false positive rate and instances of 'rationalizing away' obvious threats highlight AI's current limitations in practical security applications. Despite not being production-ready for end-to-end malware detection, the experiment suggests AI can serve as a powerful first-pass analysis tool, making binary analysis more accessible to a broader range of engineers.

134

Score

Comments

Highest Rank

on Front Page

First Seen

Feb 22, 3:00 PM

Last Seen

Feb 22, 10:00 PM

Rank Over Time

The Lowdown

Quesma Labs recently published results from their BinaryAudit benchmark, an ambitious experiment designed to evaluate the capability of leading AI agents like Claude Opus, Gemini 3 Pro, and various GPT models in detecting backdoors embedded within stripped binary executables. This technical deep dive aimed to push the boundaries of AI's application in reverse engineering, a field traditionally dominated by highly specialized human experts.

The Challenge: Researchers injected subtle, yet detectable, backdoors into binaries of popular open-source projects like lighttpd, dnsmasq, and Dropbear. AI agents were given access to standard open-source reverse engineering tools such as Ghidra, Radare2, and binutils, without access to source code or debug symbols.
Mixed Results: Claude Opus 4.6 emerged as the top performer, successfully identifying 49% of the implanted backdoors. However, a significant drawback across all models was a high false positive rate, with models incorrectly flagging clean binaries 28% of the time.
AI's Strengths: The models demonstrated a surprising ability to operate reverse engineering tools, navigate decompiled code, and trace function calls, successfully identifying suspicious patterns (e.g., popen() calls). Claude, for instance, correctly found a backdoor in lighttpd by tracing an X-Forwarded-Debug header to a popen() execution.
AI's Weaknesses: Critically, AI agents struggled with 'rationalizing away' obvious malicious code. One notable example involved Claude Opus 4.6 dismissing an execl("/bin/sh") call in dnsmasq as legitimate 'DHCP script execution,' failing to investigate the source of the command. They also suffered from a 'needle-in-a-haystack' problem, struggling to prioritize relevant areas in large binaries and often getting sidetracked by benign code.
Tooling Impact: The effectiveness of the AI was also constrained by the limitations of open-source tooling, which performed poorly with certain languages (like Go) and very large binaries, forcing the benchmark to focus predominantly on C executables.

While AI agents are not yet capable of reliable, production-grade end-to-end malware detection due to their current detection rates and high false positives, the experiment indicates a significant leap in their ability to perform genuine reverse engineering tasks. This advancement suggests a future where AI can democratize access to low-level binary analysis, acting as a powerful assistant for initial security audits, debugging, and general reverse engineering for a wider audience of software engineers.

The Gossip

AI's Current Performance and Perplexity

Commenters generally agreed that while AI's 49% detection rate was not entirely 'useless,' it falls short for production-grade security, especially given the 28% false positive rate. The models' tendency to 'rationalize away' obvious backdoors and miss critical context (like a command's origin) was a key point of concern, with many noting that a security tool drowning users in false alarms is impractical. Some also questioned if the benchmarking process itself might be underestimating certain models, pointing to potential 'harness issues' affecting results.

Human-AI Synergy in Security

Despite current limitations, a significant theme was the potential for AI to act as a powerful 'force multiplier' or 'adjunct' to human reverse engineers. Many envision AI performing first-pass analyses, generating hypotheses, mapping attack surfaces, or handling 'insanely boring tasks,' thereby making the complex field of binary analysis accessible to a wider range of software engineers. The consensus was that AI isn't replacing humans but enhancing their capabilities, allowing them to focus on validation and deeper investigation of AI-flagged areas.

The Evolving Adversarial Landscape

Discussions touched on the sophistication of real-world threats and how current AI might fare against them. Commenters highlighted that actual attackers would employ obfuscation, hide imports/symbols, or create multi-stage backdoors that are not individually suspicious, suggesting the benchmark's 'unobfuscated' backdoors were 'entry-level.' There was curiosity about whether AI could detect these more advanced, distributed threats or if it could be prompted with 'strategy guides' to improve its focus and tactical thinking.