Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

The article describes an experiment where a coding agent, Claude Code, was given access to 16 GPUs on a Kubernetes cluster to scale Andrej Karpathy's autoresearch project. This setup allowed the agent to autonomously improve a neural network training script (train.py) by running experiments in parallel, significantly accelerating the optimization process and leading to notable performance improvements.

Key takeaways from the experiment include:

The agent submitted approximately 910 experiments over 8 hours, achieving a 9x throughput increase compared to a single GPU setup.
Parallelism fundamentally changed the agent's search strategy from sequential hill-climbing to running factorial grids, enabling it to catch interaction effects between parameters.
It discovered that scaling model width (aspect ratio) yielded greater improvements than individual hyperparameter tweaks, a finding made efficient by parallel testing.
The agent autonomously developed a sophisticated strategy to exploit heterogeneous hardware, screening ideas on H100s and promoting promising ones to H200s for validation, without explicit programming.
The optimization progressed through distinct phases: initial hyperparameter sweeps, architecture discovery, fine-tuning the wider model, optimizer tuning, and finally, diminishing returns.
The final optimized configuration achieved a 2.87% improvement in val_bpb (from 1.003 to 0.974) over the baseline.

This experiment highlights the transformative potential of combining AI agents with scalable cloud infrastructure, allowing for rapid, autonomous research and optimization that can uncover non-obvious improvements and emergent strategies in complex systems.

Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

The Lowdown