Flash-KMeans: Fast and Memory-Efficient Exact K-Means

The paper introduces Flash-KMeans, a novel and highly optimized implementation of the K-means clustering algorithm designed for modern GPU workloads. Historically, K-means has been limited to offline processing due to performance constraints, but this research aims to elevate it to a first-class online primitive within AI systems by tackling fundamental limitations in existing GPU-based approaches.

The authors identify that current GPU K-means implementations are primarily bottlenecked by low-level system constraints, not theoretical algorithmic complexity.
Specifically, the assignment stage suffers from a severe I/O bottleneck due to the massive N×K distance matrix materialization in High Bandwidth Memory (HBM).
The centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations.
Flash-KMeans introduces two core kernel-level innovations to address these issues:
- FlashAssign: Fuses distance computation with an online argmin function to completely bypass intermediate memory materialization, eliminating the I/O bottleneck.
- Sort-inverse update: Explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions, mitigating write contention.
The system also integrates algorithm-system co-designs, such as chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.

Extensive evaluations on NVIDIA H200 GPUs demonstrate that Flash-KMeans achieves a significant end-to-end speedup of up to 17.9x over existing best baselines. Furthermore, it remarkably outperforms industry-standard libraries like cuML by 33x and FAISS by over 200x, showcasing its potential to revolutionize K-means applications in high-performance AI environments.

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

The Lowdown