How fast is N tokens per second really?
How fast is 'N tokens per second' really? This interactive web tool lets you visualize and feel the speed of LLM output, making abstract benchmarks tangible. It's popular on HN because it clarifies a frequently cited yet often misunderstood performance metric, especially by demonstrating how content type dramatically alters the perceived speed. Understanding token speed is crucial for anyone evaluating or building with large language models.
The Lowdown
The tokenspeed web tool addresses a common challenge in the LLM space: internalizing what numerical tokens-per-second (tok/s) benchmarks actually feel like in practice. While everyone quotes numbers like '47 tok/s on an M3,' these figures often lack a real-world frame of reference. This visualization aims to bridge that gap by simulating output at various speeds and content types.
- Visualization Modes: The tool offers four distinct modes:
code(syntax-highlighted pseudocode),text(lorem ipsum prose),think(alternating reasoning and code for 'thinking' models), andagent(simulating an AI coding agent with tool calls and pauses). - Speed Benchmarks: Users can interactively test speeds ranging from 1 tok/s (Raspberry Pi class) to 800 tok/s (Cerebras class), including typical hosted Claude/GPT speeds (60 tok/s) and Groq territory (200 tok/s).
- Perceptual Difference: A key insight is how the perceived speed varies significantly by content. Code, being more token-dense, can feel slower to stream at the same raw tok/s rate compared to prose, even though the underlying benchmark is honest.
- Tokenization: The tool approximates BPE-style tokenization, noting that short words are often one token, while longer identifiers, punctuation, and operators can split into multiple tokens.
- Prose Conversion: It provides a useful conversion: English prose averages approximately 1.3 tokens per word, meaning 30 tok/s roughly translates to 23 words per second.
Ultimately, tokenspeed provides an invaluable resource for developers and enthusiasts to intuitively grasp the impact of LLM generation speed, highlighting that the user experience is as much about content type as it is about raw numbers.
The Gossip
Visualizing Velocity: Gratitude for a Gripping Glimpse
Many commenters expressed immediate gratitude and appreciation for the `tokenspeed` tool, calling it 'great,' 'cool,' and 'neat.' They lauded its effectiveness in making the abstract concept of tokens-per-second tangible and easy to internalize, providing a much-needed 'gut feel' for LLM output speeds that benchmarks alone often fail to convey.
Performance Perception vs. Production Prowess
The discussion delved into the subjective perception of LLM speeds versus their practical utility. While some found even low rates like 5 tok/s surprisingly fast, others argued that raw token speed is secondary to output quality or the significant 'thinking' time and context window impacts that can slow down real-world interactions. The consensus was that high speeds are transformative but must be considered alongside other performance factors.
The Race for Rapid Response: Agentic Acceleration
A prominent theme revolved around the exciting implications of extremely high token generation speeds for advanced AI applications, particularly 'agentic' coding. Commenters noted that achieving 600+ tok/s on platforms like Groq or thousands of tok/s on Cerebras with specific models fundamentally changes the nature of AI agents, making sophisticated, rapid-fire interactions and multi-step reasoning a present reality for certain constrained tasks.