GLM5.2 on AMD MI355X at 2626 tok/s/node at over 2x lower cost than Blackwell

Wafer.ai showcases how optimizing GLM5.2 on AMD MI355X can deliver superior performance-per-dollar for AI inference, presenting detailed technical fixes for quantizing and running models effectively. This report challenges NVIDIA's market dominance by demonstrating that AMD's hardware, despite software friction, is a viable and cheaper alternative. The Hacker News community is captivated by the potential erosion of the CUDA moat and the prospects of more affordable AI compute.

100

Score

Comments

Highest Rank

19h

on Front Page

First Seen

Jul 3, 11:00 PM

Last Seen

Jul 4, 5:00 PM

Rank Over Time

The Lowdown

Wafer.ai's blog post details their success in achieving high-performance, cost-effective AI inference using AMD's MI355X GPUs, directly comparing it to NVIDIA's Blackwell. They argue that while NVIDIA benefits from day-0 software support, AMD hardware offers comparable specs at a significantly lower cost, and the software gap is closing.

The Problem: Skyrocketing demand for inference, high NVIDIA GPU prices, and limited Blackwell supply make tokens expensive.
AMD's Advantage: MI355X GPUs are about 2.75x cheaper than NVIDIA B300s with comparable hardware.
The Challenge: AMD's ROCm stack lacks NVIDIA's day-0 software support, requiring significant engineering for optimal performance on new models.
Wafer.ai's Achievement: They achieved 2626 tok/s/node on GLM5.2 (20k in / 1k out, 60% cache hit) on MI355X, reaching 80% of B200 performance at over 2x lower cost.
Methodology: They quantized GLM5.2 to MXFP4 using AMD Quark, showing "lossless" accuracy, and chose sglang as the inference framework. They fixed two specific bugs in sglang's ROCm image to enable speculative decode, leading to a near 3x single-stream throughput gain, and optimized MoE kernel selection for aggregate throughput.
Significance: The "CUDA moat" is eroding; achieving SOTA on AMD is becoming more about support than custom kernel development, making single-node deployments highly attractive.

Wafer.ai's work underscores the growing viability of AMD hardware for AI inference, proving that with focused optimization, it can offer a compelling performance-per-dollar advantage, challenging the status quo dominated by NVIDIA.

The Gossip

Performance Claim Complications

Commenters scrutinized the reported performance figures, questioning if the "lossless" claim for MXFP4 quantization was truly accurate given observed accuracy drops. Many pointed out that the headline 2626 tok/s was an "aggregate" metric, not raw throughput, and inquired about the impact of the 60% cache hit rate on the overall results, suggesting these factors complicate direct comparisons.

AMD's Ascent Against NVIDIA

A central theme revolved around the long-anticipated challenge to NVIDIA's CUDA dominance. While some commenters expressed skepticism about AMD's historical software support issues, others highlighted the increasing viability of AMD hardware, noting large companies like Meta and OpenAI are beginning to use them. The discussion underscores the ongoing hope for genuine competition in the AI hardware space.

Power, Price, and Practicality

Many commenters emphasized the importance of "performance per watt" as a critical metric, especially for data centers outside the US where electricity costs are higher and power supply is limited. There was debate about the actual power consumption differences between AMD and NVIDIA cards and how it impacts datacenter density and operational costs, beyond just the initial hardware price.

Future Frontiers of AI Compute

The discussion touched on the broader implications for AI hardware and optimization. Commenters expressed excitement about the potential of "agentic coding" to unlock underutilized compute resources across different architectures. There was also forward-looking speculation about NVIDIA's next-gen Rubin platform and the eventual trickle-down of powerful AI acceleration to consumer-level devices.