Can LLMs Beat Classical Hyperparameter Optimization Algorithms?
This paper investigates whether Large Language Models (LLMs) can surpass classical algorithms in hyperparameter optimization (HPO), using the autoresearch framework as a testbed. It reveals that while classical methods generally outperform pure LLM-based approaches, a novel hybrid model called Centaur, which combines the strengths of both, achieves superior results. This work offers crucial insights for ML practitioners considering LLMs for HPO, emphasizing their potential as powerful complements rather than direct replacements for established techniques.
The Lowdown
This research paper explores the effectiveness of Large Language Models (LLMs) in hyperparameter optimization (HPO) compared to traditional algorithms. Utilizing the autoresearch repository, which allows an LLM agent to directly modify training code for optimization, the study pits LLM-based methods against classical HPO techniques on a fixed compute budget for tuning a small language model.
- Methodology: The study uses
autoresearchto evaluate HPO performance. It sets up experiments where both classical algorithms and LLM agents attempt to optimize hyperparameters of a small language model. The LLMs used include frontier models like Claude Opus 4.6 and Gemini 3.1 Pro Preview. - Classical vs. LLM-only: When operating within a fixed search space, classical methods such as CMA-ES and TPE consistently outperformed pure LLM-based agents. A significant challenge for LLMs was avoiding out-of-memory failures, which classical methods handled better.
- Code Editing Power: Allowing LLMs to directly edit source code narrowed the performance gap with classical methods but did not fully close it. LLMs were found to struggle with tracking optimization state across multiple trials.
- Domain Knowledge Gap: Conversely, classical methods lack the inherent domain knowledge that LLMs possess, suggesting a potential synergy.
- Introducing Centaur: To bridge these gaps, the researchers developed Centaur, a hybrid HPO approach. Centaur shares the interpretable internal state (like mean vector, step-size, and covariance matrix) of CMA-ES with an LLM, combining the strengths of both.
- Centaur's Performance: Centaur achieved the best results in the experiments, with even a smaller 0.8B LLM proving sufficient to outperform all classical and pure LLM methods tested.
- Scaling and Diversity: The study also analyzed search diversity, the impact of model scaling (from 0.8B to frontier models), and the optimal fraction of LLM-proposed trials within Centaur.
In summary, the findings indicate that while LLMs alone may not surpass classical HPO algorithms, their true potential lies in augmenting existing methods. They are most effective when used as a complement to classical optimizers, offering domain knowledge that traditional algorithms lack, rather than serving as a complete replacement.