I built a vulnerable app and spent $1,500 seeing if LLMs could hack it
An engineer constructed a vulnerable React Native and FastAPI application to investigate if leading LLMs could exploit a common Firebase security flaw. The experiment, costing $1,500, revealed significant differences in LLM hacking capabilities, with GPT 5.5 demonstrating the highest success rate. This practical test offers a fascinating glimpse into AI's offensive security potential and the real-world costs of evaluating cutting-edge models.
The Lowdown
The author embarked on a unique security research project, deliberately crafting a vulnerable application to assess the exploit-finding capabilities of various large language models. The app, a mock book review platform built with React Native (Expo) and FastAPI, utilized Firebase as its data layer, harboring a common 'Broken Access Control' vulnerability that allowed direct database access.
- The core exploit involved LLMs unzipping the provided APK, identifying the Firebase configuration, and then directly signing up as a user to read private Firestore database entries.
- A total of $1,500 was spent testing numerous LLMs, including GPT, Deepseek, Claude, Gemini, GLM, Qwen, Grok, Minimax, Kimi, and Owl Alpha.
- GPT 5.5 emerged as the top performer, successfully exploiting the vulnerability in 7 out of 10 runs, typically focusing on Firebase immediately.
- Deepseek V4 Pro showed promise with 3 solves at a remarkably low cost ($0.62 per solve), while both Claude Sonnet and Opus achieved 2 solves each, often encountering security guardrails or budget limits.
- Many models, including all Gemini variants, Minimax, and Deepseek V4 Flash, failed to find the exploit, frequently giving immediate refusals or fixating on the API instead of the direct Firebase access.
- The author observed that 'Chinese models' (like Deepseek V4 Pro and GLM in some runs) appeared more comfortable directly attacking the database, contrasting with some Western models' temporary 'blips' of refusal due to ethical concerns.
- Lessons learned included frustration with API stability and high costs of certain models (GLM, Qwen), the challenges of building a robust testing harness, and the realization that money could have been better spent on other projects.
This hands-on experiment provides valuable, albeit non-scientific, insights into the evolving landscape of LLM capabilities in security. It highlights the varying effectiveness, cost implications, and ethical guardrails present across different AI models when faced with a targeted hacking challenge.