Attention Residuals
Researchers introduce Attention Residuals (AttnRes), a novel modification to Transformer architectures that replaces traditional residual connections with a learned attention mechanism over previous layers. This enables models to selectively aggregate earlier representations, leading to significant performance gains and improved training stability across various benchmarks. As a drop-in replacement, AttnRes offers a promising path toward more efficient and capable large language models, a perpetual interest on Hacker News.
The Lowdown
Attention Residuals (AttnRes) proposes a fundamental architectural change to Transformers by reinventing how information flows through residual connections. Instead of uniformly summing prior layer outputs, AttnRes allows each layer to intelligently select and aggregate relevant earlier representations, addressing key limitations of standard residual connections in deep networks.
- The Problem with Standard Residuals: Traditional residual connections accumulate all prior layer outputs with fixed, uniform weights. This approach leads to dilution of individual layer contributions as networks deepen and can cause hidden-state magnitudes to grow unboundedly, particularly with PreNorm architectures.
- Full AttnRes: This method replaces fixed accumulation with a softmax attention mechanism over all preceding layer outputs. Each layer learns a pseudo-query to selectively weigh and aggregate relevant past representations, granting content-aware access to all earlier information.
- Block AttnRes: To address the O(Ld) memory requirements of Full AttnRes at scale, Block AttnRes partitions layers into blocks. It uses standard residuals within blocks and applies attention only over these block-level representations. This variant significantly reduces memory overhead while retaining most of Full AttnRes's performance benefits.
- Performance and Efficiency: AttnRes consistently outperforms baselines across various compute budgets. Block AttnRes, for instance, achieves comparable loss to a baseline model trained with 1.25x more computational resources. The most notable improvements are seen in multi-step reasoning tasks (e.g., +7.5 on GPQA-Diamond) and code generation (e.g., +3.1 on HumanEval).
- Improved Training Dynamics: AttnRes effectively mitigates the "PreNorm dilution" problem by keeping output magnitudes bounded across depth and promoting a more uniform distribution of gradient norms across layers, leading to more stable training.
In essence, AttnRes offers a clever and effective way to enhance Transformer performance and stability by allowing layers to dynamically choose which past information to leverage, providing a practical upgrade for the next generation of deep learning models.