We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them
Quesma Labs put top AI models like Claude and Gemini to the test, challenging them to find cleverly hidden backdoors in ~40MB compiled binaries using tools like Ghidra. While the best model achieved a 49% detection rate, the high false positive rate and instances of 'rationalizing away' obvious threats highlight AI's current limitations in practical security applications. Despite not being production-ready for end-to-end malware detection, the experiment suggests AI can serve as a powerful first-pass analysis tool, making binary analysis more accessible to a broader range of engineers.
The Lowdown
Quesma Labs recently published results from their BinaryAudit benchmark, an ambitious experiment designed to evaluate the capability of leading AI agents like Claude Opus, Gemini 3 Pro, and various GPT models in detecting backdoors embedded within stripped binary executables. This technical deep dive aimed to push the boundaries of AI's application in reverse engineering, a field traditionally dominated by highly specialized human experts.
- The Challenge: Researchers injected subtle, yet detectable, backdoors into binaries of popular open-source projects like
lighttpd,dnsmasq, andDropbear. AI agents were given access to standard open-source reverse engineering tools such as Ghidra, Radare2, and binutils, without access to source code or debug symbols. - Mixed Results: Claude Opus 4.6 emerged as the top performer, successfully identifying 49% of the implanted backdoors. However, a significant drawback across all models was a high false positive rate, with models incorrectly flagging clean binaries 28% of the time.
- AI's Strengths: The models demonstrated a surprising ability to operate reverse engineering tools, navigate decompiled code, and trace function calls, successfully identifying suspicious patterns (e.g.,
popen()calls). Claude, for instance, correctly found a backdoor inlighttpdby tracing anX-Forwarded-Debugheader to apopen()execution. - AI's Weaknesses: Critically, AI agents struggled with 'rationalizing away' obvious malicious code. One notable example involved Claude Opus 4.6 dismissing an
execl("/bin/sh")call indnsmasqas legitimate 'DHCP script execution,' failing to investigate the source of the command. They also suffered from a 'needle-in-a-haystack' problem, struggling to prioritize relevant areas in large binaries and often getting sidetracked by benign code. - Tooling Impact: The effectiveness of the AI was also constrained by the limitations of open-source tooling, which performed poorly with certain languages (like Go) and very large binaries, forcing the benchmark to focus predominantly on C executables.
While AI agents are not yet capable of reliable, production-grade end-to-end malware detection due to their current detection rates and high false positives, the experiment indicates a significant leap in their ability to perform genuine reverse engineering tasks. This advancement suggests a future where AI can democratize access to low-level binary analysis, acting as a powerful assistant for initial security audits, debugging, and general reverse engineering for a wider audience of software engineers.
The Gossip
AI's Current Performance and Perplexity
Commenters generally agreed that while AI's 49% detection rate was not entirely 'useless,' it falls short for production-grade security, especially given the 28% false positive rate. The models' tendency to 'rationalize away' obvious backdoors and miss critical context (like a command's origin) was a key point of concern, with many noting that a security tool drowning users in false alarms is impractical. Some also questioned if the benchmarking process itself might be underestimating certain models, pointing to potential 'harness issues' affecting results.
Human-AI Synergy in Security
Despite current limitations, a significant theme was the potential for AI to act as a powerful 'force multiplier' or 'adjunct' to human reverse engineers. Many envision AI performing first-pass analyses, generating hypotheses, mapping attack surfaces, or handling 'insanely boring tasks,' thereby making the complex field of binary analysis accessible to a wider range of software engineers. The consensus was that AI isn't replacing humans but enhancing their capabilities, allowing them to focus on validation and deeper investigation of AI-flagged areas.
The Evolving Adversarial Landscape
Discussions touched on the sophistication of real-world threats and how current AI might fare against them. Commenters highlighted that actual attackers would employ obfuscation, hide imports/symbols, or create multi-stage backdoors that are not individually suspicious, suggesting the benchmark's 'unobfuscated' backdoors were 'entry-level.' There was curiosity about whether AI could detect these more advanced, distributed threats or if it could be prompted with 'strategy guides' to improve its focus and tactical thinking.