Flash-Moe: Running a 397B Parameter Model on a Mac with 48GB RAM

This project unveils Flash-MoE, an ingenious C/Metal inference engine that defies conventional wisdom, enabling a massive 397B parameter Mixture-of-Experts model to run on a MacBook Pro with just 48GB RAM. It achieves this feat by streaming the 209GB model directly from SSD, hitting 4.4+ tokens/second through highly optimized, framework-free code. Hacker News is impressed by the technical prowess and the exploration of Apple Silicon's unique capabilities for large language model inference.

Score

Comments

Highest Rank

on Front Page

First Seen

Mar 22, 12:00 PM

Last Seen

Mar 22, 7:00 PM

Rank Over Time

The Lowdown

Flash-MoE is a groundbreaking project demonstrating how to run the Qwen3.5-397B-A17B model, a 397 billion parameter Mixture-of-Experts (MoE) model, on an Apple MacBook Pro with only 48GB of unified memory. The core innovation lies in its pure C/Metal implementation, which eschews traditional frameworks like Python and relies on hand-tuned Metal shaders and C code to stream the entire 209GB model from the SSD. This setup achieves an impressive 4.4+ tokens/second, producing production-quality output, including tool calling.

SSD Expert Streaming: Experts are loaded on demand from the NVMe SSD using parallel pread(), with the OS page cache managing data efficiently without custom caching.
FMA-Optimized Dequant Kernel: A custom Metal kernel rearranges calculations to leverage the GPU's fused multiply-add unit, boosting performance by 12%.
Metal Compute Shaders: Hand-written shaders handle matrix operations, activations, normalization, and attention for maximum efficiency.
Deferred GPU Expert Compute: GPU and CPU tasks are cleverly overlapped, submitting expert forward passes without waiting, optimizing the pipeline.
Accelerate BLAS for Linear Attention: Utilizes Apple's Accelerate framework for significant speedups in the GatedDeltaNet recurrence.
Trust the OS Principle: The project explicitly avoids custom caching, relying on the macOS page cache for expert data management, which proved superior to bespoke solutions.

The project highlights the M3 Max chip's capabilities, including its 16-core CPU, 40-core GPU, 48GB unified memory, and a 17.5 GB/s sequential read SSD. The codebase is entirely C, Objective-C, and Metal, showcasing a highly optimized approach to large model inference within tight memory constraints. Many alternative approaches, such as LZ4 compression or speculative decoding, were explored and discarded due to performance penalties.

In essence, Flash-MoE pushes the boundaries of local LLM inference, proving that with deep architectural understanding and low-level optimization, even massive models can become accessible on consumer-grade hardware.

The Gossip

Performance Ponderings

Commenters lauded the technical achievement, calling it "very impressive" and "promising." However, there was a mixed reaction to the reported 4.4+ tokens/second (TPS), with some finding it acceptable for local inference, while others noted it's "not that good" for 300B+ models, especially given 4-bit quantization. One user reported achieving a higher 6.55 TPS on an M5 Pro with 64GB RAM, sparking discussion about hardware variations and potential for better performance.

SSD Strain Speculations

A significant point of concern revolved around the continuous streaming of a 209GB model from the SSD. Many commenters worried about the potential for significant degradation of the SSD's lifespan due to constant read operations. This led to a brief debate, with some initially expressing concern about 24/7 usage impacting SSD longevity, while others clarified that read-only workloads generally do not cause wear on modern SSDs, which are primarily degraded by write cycles.

Apple Architecture Appraisals

The unique architecture of Apple Silicon, particularly its unified memory and high-speed SSDs, was a key talking point. Commenters discussed how the project leverages these features, such as the impressive measured 17.5 GB/s sequential read speed from the Apple Fabric SSD, which some found surprising given typical bandwidth expectations. There were also questions about whether similar approaches could be viable on Linux systems, highlighting the Mac-specific optimizations.