Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

This fascinating project, "Paper Tape Is All You Need," explores the ambitious endeavor of implementing and training a single-layer, single-head transformer neural network on a DEC PDP-11/34A minicomputer from 1976. Building upon previous work with neural networks on vintage machines, the author set out to prove that even complex modern AI architectures could be adapted to the severe computational and memory limitations of 1970s hardware. The core challenge involved meticulous optimization, fixed-point arithmetic, and custom assembly programming to achieve practical training times for a task like sequence reversal.

Transformer Architecture: The implemented transformer is a simplified encoder-only model with one layer and one attention head. It processes sequences of 8 digits, aiming to reverse them, a task specifically chosen to leverage self-attention mechanisms. The model has 1,216 parameters, utilizing a d_model of 16 and a 10-digit vocabulary.
Hardware Optimization: Initially, a Fortran IV implementation required 6.5 hours for training, deemed unacceptable. Significant optimizations included hand-tuning per-layer learning rates, switching from floating-point to custom fixed-point arithmetic, and developing NN11, a minimal assembly-level neural network stack. This stack carefully manages precision (Q8 forward, Q15 backward, Q16 accumulators) and leverages PDP-11 instruction sets for efficiency.
Performance Breakthrough: These optimizations dramatically reduced training time. An initial 1,500 training steps (6.5 hours) became 600 steps (2.5 hours) with learning rate tuning, and finally, a mere 350 steps (5.5 minutes) on a real PDP-11/34A through the NN11 assembly implementation.
Prototyping and Validation: Before committing to assembly, the floating-point and fixed-point arithmetic were prototyped and validated using Sheaf, the author's functional ML framework, which provided crucial tooling for correctness and numerical range analysis.
Implementation Cleverness: To circumvent the lack of a floating-point unit, transcendental functions like exponentiation and logarithm were replaced with precomputed lookup tables stored directly in memory. The backpropagation gradient for softmax and cross-entropy was simplified to avoid computationally expensive logarithm operations.
Resource Footprint: The entire transformer model and its training code, including data and accumulators, occupies just 19.2 KB of memory, well within the PDP-11's 32 KB core memory limit.
Execution: The project provides build instructions for the MACRO-11 assembler and can be run on physical PDP-11 hardware, a cycle-accurate emulator (ll-34), or even a WebAssembly demo for quick exploration.

This project stands as a testament to the enduring principles of computation and the remarkable creativity possible when modern problems are tackled with historical constraints. It showcases how fundamental AI concepts can be distilled and meticulously engineered to run on hardware far removed from today's powerful GPUs, bridging half a century of computing advancements.

Paper Tape Is All You Need – Training a Transformer on a 1976 Minicomputer

The Lowdown