Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

In a testament to raw optimization, Matt Gallagher embarks on an impressive technical journey, aiming to hand-implement and massively accelerate matrix multiplication for Large Language Model (LLM) training in Swift on Apple Silicon. Emulating Andrej Karpathy's minimalist 'llm.c', he foregoes high-level frameworks to explore the intricate layers of performance tuning, from basic Swift constructs to direct hardware exploitation.

The initial basic Swift implementation of 'matmul_forward' proved 15-20 times slower than its C counterpart, clocking in at a mere 2.8 Gflop/s.
Addressing Swift's 'Array' copy-on-write overhead, the introduction of 'MutableSpan' from Swift 6.2 significantly boosted training performance.
To achieve C-like fused-multiply-add (FMA) performance, the 'Relaxed.multiplyAdd' function from Swift-Numerics was adopted, leading to a 10x speedup in inference by enabling SIMD vectorized instructions.
By mirroring C's loop unrolling strategies with Swift's new 'InlineArray', the author achieved performance parity between Swift and C for core matrix operations.
Multi-threading was implemented using 'DispatchQueue.concurrentPerform', resulting in a 5-6.6x improvement, though it introduced complexity and reduced code readability.
Further optimization on the CPU involved leveraging undocumented AMX (Apple Matrix Coprocessor) instructions via reverse engineering, yielding an additional 1.67x speedup.
Finally, the transition to Metal GPU shaders, starting with a naive kernel and progressing to a tiled implementation, pushed the total performance past 1 Tflop/s.

Gallagher successfully transforms a sluggish 2.8 Gflop/s Swift implementation into a formidable 1.1 Tflop/s powerhouse, a 382-fold improvement. This odyssey underscores that while Swift can indeed match or even surpass C in raw speed, it often comes at the cost of code elegance and readability, particularly when delving into multi-threading and direct hardware interfaces. Ultimately, even with such gains, the performance for practical LLM training remains insufficient, setting the stage for future exploration of Apple's optimized machine learning frameworks.

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

The Lowdown