HN
Today

Research-Driven Agents: What Happens When Your Agent Reads Before It Codes

This post unveils "Research-Driven Agents," AI coding bots that significantly boost code optimization by incorporating a research phase before coding. By studying papers and competing projects, an agent optimized llama.cpp, achieving a +15% speedup in Flash Attention for $29. It showcases a powerful, cost-effective paradigm for AI to tackle complex engineering problems beyond mere code-tweaking.

15
Score
2
Comments
#3
Highest Rank
2h
on Front Page
First Seen
Apr 9, 6:00 PM
Last Seen
Apr 9, 7:00 PM
Rank Over Time
113

The Lowdown

Alex Kim's blog post introduces "Research-Driven Agents," an advanced iteration of AI coding agents that drastically improve code optimization by integrating a dedicated research phase into their workflow. This approach allows agents to gather external knowledge from academic papers and competing projects before attempting to modify code.

Key takeaways from the research:

  • Knowledge-Augmented Optimization: Unlike "code-only" agents that struggle with complex, memory-bound bottlenecks, research-driven agents read papers and study competing projects to identify non-obvious optimization opportunities.
  • Real-World Application: Pointed at llama.cpp's CPU inference path for TinyLlama 1.1B, an agent leveraging 4 AWS VMs optimized Flash Attention text generation by +15% on x86 and +5% on ARM in approximately 3 hours.
  • Cost-Efficient Exploration: The entire optimization process incurred a minimal cost of around $29 (primarily for CPU VMs and API calls), demonstrating remarkable efficiency.
  • Research Yields Deeper Insights: Initial code-only attempts produced negligible gains. After a research phase, the agent pivoted from micro-optimizations to memory-access pattern improvements and operator fusions, recognizing that text generation was memory-bandwidth bound.
  • Successful Fusions: The agent successfully implemented five key optimizations, including single-pass softmax and RMS norm fusions, adaptive from_float parallelization, graph-level RMS_NORM + MUL fusion (inspired by CUDA/Metal backends), and Flash Attention KQ fusion.
  • Practical Challenges: The process encountered issues like compiler auto-vectorization negating some "optimizations," a benchmark parsing bug leading to incorrect baselines, and variance due to noisy cloud VMs, all of which the agent or its operators learned from.
  • Broader Implications: This research extends the autoresearch paradigm to problems where crucial optimization insights reside outside the immediate codebase, suggesting AI agents can now address higher-level engineering challenges that traditionally require senior human expertise and domain knowledge.

This work highlights a pivotal advancement for AI coding agents, demonstrating that equipping them with a "research" capability unlocks a new level of efficacy in complex performance optimization, moving them closer to autonomously tackling nuanced engineering problems with a deep understanding of external context.