HN
Today

Do Transformers Need Three Projections? Systematic Study of QKV Variants

A new paper systematically challenges the long-held assumption that Transformers require three distinct Query, Key, and Value projections. By strategically sharing projections, researchers achieve significant memory savings, cutting KV cache by up to 96.9%, without sacrificing performance. This deep technical dive into Transformer architecture has major implications for enabling more efficient, on-device AI inference.

19
Score
2
Comments
#1
Highest Rank
18h
on Front Page
First Seen
Jun 4, 11:00 PM
Last Seen
Jun 5, 4:00 PM
Rank Over Time
21122456691010131622242730

The Lowdown

Transformers have become ubiquitous in AI, with the Query, Key, and Value (QKV) attention mechanism at their core. Despite its centrality, the precise individual contributions of these three projections and the effects of their omission have remained largely unexplored. This paper presents a systematic evaluation of various projection sharing constraints to understand their impact and potential benefits.

Here's a breakdown of the key findings:

  • The study evaluates three projection sharing constraints: Q-K=V (shared key-value), Q=K-V (shared query-key), and Q=K=V (single projection).
  • For variants leading to symmetric attention maps (Q=K-V, Q=K=V), asymmetric attention is explored using 2D positional encodings.
  • Experiments span synthetic tasks, vision datasets (MNIST, CIFAR, TinyImageNet, anomaly detection), and large-scale language modeling (300M and 1.2B parameter models on 10B tokens).
  • Crucially, shared projection transformers perform on par with, or occasionally even better than, traditional QKV transformers across these diverse tasks.
  • In language modeling, the Q-K=V sharing variant reduced KV cache by 50% with only a 3.1% perplexity degradation.
  • This projection sharing is complementary to existing head sharing techniques like Grouped Query Attention (GQA) and Multi-Query Attention (MQA); combining Q-K=V with GQA-4 yields an 87.5% cache reduction, while Q-K=V + MQA achieves an impressive 96.9% reduction, making on-device inference far more practical.
  • The paper explains that Q-K=V preserves quality because keys and values can occupy similar representational spaces, and attention operates in a low-rank regime. Conversely, Q=K-V fails because it breaks attention directionality.

This research highlights projection sharing as a powerful, underexplored form of weight tying in attention mechanisms. It offers direct, quantifiable inference memory benefits that are particularly valuable for deploying AI models to edge devices, opening new avenues for efficient AI applications.