Soul Player C64 – A real transformer running on a 1 MHz Commodore 64
An incredible technical feat, this project ports a 25,000-parameter transformer model to an unmodified 1 MHz Commodore 64, running on hand-written 6502 assembly. It showcases modern AI architecture, like multi-head attention and RMSNorm, operating on vintage hardware, fitting entirely on a floppy disk. This blend of retro computing and cutting-edge AI optimization is catnip for Hacker News, highlighting ingenuity over raw computational power.
The Lowdown
The "Soul Player C64" project presents an extraordinary engineering feat: a fully functional transformer model, akin to the architecture powering modern LLMs, meticulously implemented and running on a 1 MHz Commodore 64. This ambitious endeavor brings sophisticated AI to vintage hardware, demonstrating what's possible with extreme optimization and a deep understanding of low-level programming.
- Core Architecture: It's a 2-layer, decoder-only transformer with ~25,000 int8 parameters, featuring multi-head causal self-attention, softmax, and RMSNorm, all coded in hand-written 6502/6510 assembly.
- Hardware Constraints & Optimizations: Designed for an unmodified C64, the entire model and code fit on a standard floppy disk. A key breakthrough involved a 14-bit shift for softmax score normalization to allow the 128-entry exponent lookup table to function correctly on integer arithmetic.
- Performance: While functional, the model is inherently slow, processing approximately one token every 60 seconds on real hardware, meaning a full response takes several minutes.
- Customization & Training: The project provides tools to train custom "souls" from user-defined corpuses using Python, which includes BPE tokenizer training and Quantization-Aware Training (QAT). The training prioritizes int8 quality over float loss.
- Technical Specifications: The model uses a 128-token vocabulary, 32-dimensional embeddings, 4 attention heads, and 64 FFN hidden units. All activations are Q8.8 fixed-point, and weights are int8, with all operations relying on shift-and-add given the 6502's lack of a hardware multiplier.
- Limitations: Due to its small parameter count, the model is not "smart" and will produce broken sentences. It also has a small context window (20 tokens) and requires lowercase input. This project stands as a remarkable demonstration of how modern AI concepts can be adapted and executed on extremely constrained, historical computing platforms. It's a compelling proof-of-concept that blurs the lines between retrocomputing and contemporary machine learning, highlighting the power of fundamental algorithmic understanding and assembly-level optimization.