HN
Today

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Flash-KMeans re-engineers the classic K-means algorithm, transforming it into a high-performance online primitive suitable for modern AI systems. This work meticulously addresses GPU memory I/O and atomic write contention, pinpointing low-level system constraints as the true bottleneck, not algorithmic complexity. The result is a profound leap in efficiency, achieving multi-fold speedups over current industry standards by rethinking how the algorithm interacts with hardware.

5
Score
0
Comments
#3
Highest Rank
10h
on Front Page
First Seen
Mar 20, 10:00 AM
Last Seen
Mar 20, 7:00 PM
Rank Over Time
13334658131217

The Lowdown

The paper introduces Flash-KMeans, a novel and highly optimized implementation of the K-means clustering algorithm designed for modern GPU workloads. Historically, K-means has been limited to offline processing due to performance constraints, but this research aims to elevate it to a first-class online primitive within AI systems by tackling fundamental limitations in existing GPU-based approaches.

  • The authors identify that current GPU K-means implementations are primarily bottlenecked by low-level system constraints, not theoretical algorithmic complexity.
  • Specifically, the assignment stage suffers from a severe I/O bottleneck due to the massive N×K distance matrix materialization in High Bandwidth Memory (HBM).
  • The centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations.
  • Flash-KMeans introduces two core kernel-level innovations to address these issues:
    • FlashAssign: Fuses distance computation with an online argmin function to completely bypass intermediate memory materialization, eliminating the I/O bottleneck.
    • Sort-inverse update: Explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions, mitigating write contention.
  • The system also integrates algorithm-system co-designs, such as chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.

Extensive evaluations on NVIDIA H200 GPUs demonstrate that Flash-KMeans achieves a significant end-to-end speedup of up to 17.9x over existing best baselines. Furthermore, it remarkably outperforms industry-standard libraries like cuML by 33x and FAISS by over 200x, showcasing its potential to revolutionize K-means applications in high-performance AI environments.