From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem
This technical dive demystifies the KV cache, the literal memory of large language models, explaining its physical and economic implications. It traces the evolution of memory architectures, from GPT-2's wasteful recall to Gemma 3's selective attention, highlighting how these engineering choices shape AI's 'mind.' Ultimately, the piece prompts reflection on how AI memory is designed today versus how it might evolve towards greater autonomy, a topic always hot among HN's technically curious crowd.
The Lowdown
This article explores the concept of the KV (Key-Value) cache in Large Language Models (LLMs), a critical but often overlooked component that dictates how these models 'remember' conversational context. It delves into the technical mechanisms, the practical implications for users and providers, and the philosophical underpinnings of memory evolution in AI, ultimately questioning the future of intelligent systems' ability to manage their own cognition.
- KV Cache Explained: The KV cache stores key-value pairs representing processed tokens, preventing redundant computation of past conversational elements and reducing computational cost from quadratic to linear. This storage has tangible costs in GPU memory, power, and dollars.
- Evolution of LLM Memory: The article traces memory architecture improvements across models:
- GPT-2 (300 KiB/token): Simple multi-head attention, each head remembered everything independently.
- Llama 3 (128 KiB/token): Grouped-query attention (GQA) allowed multiple query heads to share key-value pairs, reducing memory without significant quality loss.
- DeepSeek V3 (68.6 KiB/token): Multi-head latent attention (MLA) compressed KV tensors into a lower-dimensional latent space.
- Gemma 3: Introduced a sliding window with local and global attention layers, prioritizing recent context.
- Mamba: State Space Models (SSMs) like Mamba offer an alternative without a KV cache, maintaining a fixed-size hidden state by filtering information in real-time.
- User Experience and Economics: The ephemeral nature of the KV cache leads to noticeable delays when resuming old conversations, as the cache must be rebuilt. This 'remembering' has clear economic implications, reflected in prompt caching discounts offered by API providers like OpenAI and Anthropic. Long conversations also suffer from 'context rot' due to attention spreading thin.
- The Medium-Term Memory Void: Current LLM architectures lack native medium-term memory, relying on external heuristics like RAG (Retrieval-Augmented Generation), databases, and system prompts to bridge the gap between volatile working memory and permanent model weights. These are functional but externally bolted-on solutions.
- The Compaction Problem: When the KV cache grows too large, models can summarize their own context, a process called 'prompted compaction.' This is lossy and can lead to models forgetting critical details. While 'learned compaction' (e.g., Cursor's approach) shows promise for specific domains like coding, its effectiveness in general conversational contexts where 'important details' are less clear remains a challenge.
- External Memory Parallels: The reliance on external files, databases, and searchable systems mirrors how humans augment their own biological memory with tools like spreadsheets and bookmarks. This externalization offers transparency and auditability that internal model mechanisms lack.
- Reshaping the AI Mind: The architectural shifts in memory reflect decisions about how an AI structures its experience—trading raw detail for greater scale. The article draws a parallel to Greg Egan's Diaspora, where digital citizens reshape their cognition. Currently, humans design these memory systems, but the emergence of learned compaction hints at a future where AI might have more agency over its own memory management.
The KV cache is more than a technical detail; it's the physical foundation of AI's ephemeral consciousness. The ongoing evolution of its design is a fundamental choice about what AI remembers, what it discards, and whether it will eventually gain a say in defining its own cognitive architecture.