HN
Today

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

This deep technical dive chronicles an ambitious quest to hand-optimize matrix multiplication in Swift for LLM training on Apple Silicon, pushing performance from Gflop/s to Tflop/s. The author meticulously re-implements Andrej Karpathy's 'llm.c' from scratch, exploring everything from 'MutableSpan' and SIMD to undocumented AMX instructions and Metal GPU shaders. It's a fascinating, low-level journey into squeezing every last flop out of Apple hardware, resonating deeply with HN's performance-obsessed and systems-minded audience.

13
Score
0
Comments
#2
Highest Rank
21h
on Front Page
First Seen
May 11, 2:00 PM
Last Seen
May 12, 10:00 AM
Rank Over Time
72244481014151516162019222827252630

The Lowdown

In a testament to raw optimization, Matt Gallagher embarks on an impressive technical journey, aiming to hand-implement and massively accelerate matrix multiplication for Large Language Model (LLM) training in Swift on Apple Silicon. Emulating Andrej Karpathy's minimalist 'llm.c', he foregoes high-level frameworks to explore the intricate layers of performance tuning, from basic Swift constructs to direct hardware exploitation.

  • The initial basic Swift implementation of 'matmul_forward' proved 15-20 times slower than its C counterpart, clocking in at a mere 2.8 Gflop/s.
  • Addressing Swift's 'Array' copy-on-write overhead, the introduction of 'MutableSpan' from Swift 6.2 significantly boosted training performance.
  • To achieve C-like fused-multiply-add (FMA) performance, the 'Relaxed.multiplyAdd' function from Swift-Numerics was adopted, leading to a 10x speedup in inference by enabling SIMD vectorized instructions.
  • By mirroring C's loop unrolling strategies with Swift's new 'InlineArray', the author achieved performance parity between Swift and C for core matrix operations.
  • Multi-threading was implemented using 'DispatchQueue.concurrentPerform', resulting in a 5-6.6x improvement, though it introduced complexity and reduced code readability.
  • Further optimization on the CPU involved leveraging undocumented AMX (Apple Matrix Coprocessor) instructions via reverse engineering, yielding an additional 1.67x speedup.
  • Finally, the transition to Metal GPU shaders, starting with a naive kernel and progressing to a tiled implementation, pushed the total performance past 1 Tflop/s.

Gallagher successfully transforms a sluggish 2.8 Gflop/s Swift implementation into a formidable 1.1 Tflop/s powerhouse, a 382-fold improvement. This odyssey underscores that while Swift can indeed match or even surpass C in raw speed, it often comes at the cost of code elegance and readability, particularly when delving into multi-threading and direct hardware interfaces. Ultimately, even with such gains, the performance for practical LLM training remains insufficient, setting the stage for future exploration of Apple's optimized machine learning frameworks.