Accelerating Gemma 4: faster inference with multi-token prediction drafters
Google supercharges its Gemma 4 models with Multi-Token Prediction (MTP) drafters, enabling up to 3x faster inference without compromising output quality. This technical leap, leveraging speculative decoding, is thrilling developers by bringing frontier-class AI speeds to personal devices and edge computing. Hacker News users are buzzing about the practical implications for local LLMs, comparing it to other open-source models, and debating Google's broader strategy.
The Lowdown
Google has rolled out a significant performance enhancement for its Gemma 4 large language models: Multi-Token Prediction (MTP) drafters. This innovation, building on speculative decoding, aims to drastically reduce inference latency and improve responsiveness, making Gemma 4 models even more accessible and powerful for developers across various platforms. The core idea is to overcome the memory-bandwidth bottleneck of traditional LLM inference, which spends excessive time generating a single token.
- MTP utilizes a lightweight "drafter" model to predict several future tokens simultaneously, while a heavier target model (e.g., Gemma 4 31B) verifies these predictions in parallel. This significantly reduces the time it takes to generate a sequence of tokens.
- The technique promises up to a 3x speedup, critically, without any degradation in output quality or reasoning logic, as the primary Gemma model still performs the final verification.
- Key benefits include improved responsiveness for real-time applications, supercharged local development on consumer GPUs, and enhanced on-device performance for edge models, all while preserving battery life.
- Under the hood, architectural enhancements allow drafter models to seamlessly share the target model's activations and KV cache, and specific optimizations like efficient clustering are applied for edge models.
- The MTP drafters are available today under an Apache 2.0 license, with weights available on Hugging Face and Kaggle, and integration with popular inference engines like MLX, vLLM, SGLang, and Ollama.
This release represents a notable advancement in making powerful AI models more efficient and accessible, empowering developers to deploy high-performance LLMs in resource-constrained environments.
The Gossip
Performance & Peer Comparisons
Users are excitedly sharing their benchmark numbers and comparing Gemma 4 + MTP with other popular open-source models like Qwen 3.6. While some find Gemma faster and efficient (especially in terms of tokens-per-output), others still prefer Qwen for specific tasks like coding or overall accuracy, even if it means sacrificing some speed. There's general enthusiasm for the rapid progress in local inference.
Implementation & Integration Insights
Many comments focus on the practicalities of getting MTP working. Discussions revolve around its integration into llama.cpp (with ongoing PRs), LM Studio, and Ollama, with users sharing tips and reporting success or challenges with specific hardware and model configurations (e.g., 24GB VRAM, Apple Silicon). Questions also arise about the exact nature of the MTP models (new releases vs. existing model heads).
Google's Strategic Stance
Commenters speculate on Google's strategy behind open-sourcing Gemma and prioritizing performance-to-compute efficiency over raw performance, contrasting it with other "frontier" model providers. Some suggest it's a play for broader adoption, distribution to billions of users, or even a move to counter large AI cloud labs, rather than solely promoting their own cloud offerings.
The "Dial-Up" Analogy & Future Speeds
The discussion often veers into the broader future of LLM inference speed, drawing parallels to the early internet's transition from dial-up to broadband. Users muse about a future where LLM responses are instantaneous (thousands of tokens per second), envisioning what applications might become possible with such speed, and highlighting companies working on dedicated hardware solutions.