Tiny hackable CUDA language model implementation

This GitHub repository showcases a concise and hackable implementation of a generative pretrained transformer. Designed to process sequences of bytes, this model learns to predict the next byte based on previous context, making it versatile enough to handle diverse data types beyond typical text, such as DNA sequences, compressed data, images, or even executable binaries.

Architectural Core: The model features a multi-layer transformer, starting with a token embedding layer that converts each byte into a continuous vector. Each transformer layer incorporates a causal self-attention mechanism and a feed-forward network, both enhanced with residual connections.
Causal Attention: A critical component, the causal attention mechanism ensures that predictions at any position only leverage information from preceding positions, which is fundamental for autoregressive generation. It employs rotational positional encoding for queries and keys and uses a causal mask during scaled dot-product attention.
Feed-Forward Network: This part of each layer applies two linear transformations separated by a Swish activation function, a smooth, non-monotonic activation known for its performance in deep networks.
Training & Optimization: The final hidden states are projected to logits over the 256-byte vocabulary, converted to probabilities via softmax, and trained using cross-entropy loss. The AdamW optimizer, known for decoupling weight decay from gradient updates and incorporating L2 regularization, handles the optimization.
Efficiency: The implementation leverages Basic Linear Algebra Subprograms (BLAS) for efficient matrix operations, allowing it to be effectively trained on modern hardware.
Sample Output: The repository includes example make infer commands demonstrating the model generating human-like fairy tales when prompted, highlighting its text generation capabilities despite its byte-centric design.

Overall, this project provides an excellent educational resource for anyone looking to understand the foundational principles and practical implementation details of transformer-based language models at a fundamental level.

Tiny hackable CUDA language model implementation

The Lowdown