Smallest transformer that can add two 10-digit numbers

The 'AdderBoard' challenge seeks the smallest transformer capable of accurately adding two 10-digit numbers, pitting hand-coded architectures against trained models. This competition highlights the surprising efficiency achievable for basic arithmetic within transformer constraints, revealing insights into their fundamental capabilities. It's popular on HN for its deep dive into transformer mechanics and the ongoing quest for minimal, yet powerful, AI models.

Score

Comments

Highest Rank

20h

on Front Page

First Seen

Feb 28, 12:00 AM

Last Seen

Feb 28, 9:00 PM

Rank Over Time

The Lowdown

The "AdderBoard" challenge, hosted on GitHub, pushes the boundaries of transformer efficiency by seeking the smallest model that can accurately add two 10-digit numbers. Originating from an exploration of LLMs like Claude Code and Codex performing addition, the community has dramatically reduced the parameter count for this task.

The Challenge: Build an autoregressive transformer achieving >= 99% accuracy on 10,000 held-out test pairs for 10-digit addition.
Two Categories: Submissions are divided into "Trained Weights" (learned from data using generic algorithms) and "Hand-Coded Weights" (analytically set weights, proving architectural capability).
Strict Rules: Models must be genuine autoregressive transformers with self-attention, generate tokens one at a time, and rely on a standard forward pass without problem-specific control flow in the inference code. The model, not the Python code, must perform the work.
Key Findings: Community efforts reveal a 'parameter cliff' around 800 for trained models, with single layers often outperforming two layers. Hand-coded models consistently achieve significantly smaller sizes (e.g., 36 parameters for 100% accuracy) compared to the best-trained models (e.g., 311 parameters), suggesting the architectural capacity is far smaller than what current training methods easily discover. Tricks like rank-3 factorization and ALiBi positional encoding have been crucial in reducing parameter counts.

This initiative serves as a fascinating case study into the minimal representational power of transformers, specifically how their core mechanisms—attention, MLPs, and autoregression—can be distilled to solve seemingly simple, yet structurally complex, problems like multi-digit addition.

The Gossip

Miniature Model Mysteries

Commenters expressed both awe and skepticism regarding the remarkably low parameter counts achieved, particularly by hand-coded models. Some questioned the >99% accuracy threshold, suggesting it might hide deeper issues or imply that these highly optimized solutions are not truly 'discoverable' through standard training. The discrepancy between hand-coded and trained parameter counts sparked discussion on the limits of current training algorithms versus theoretical architectural efficiency.

Definitional Debates

A core theme revolved around the strict rules defining a 'genuine autoregressive transformer' and whether some highly optimized solutions stretched the spirit of the challenge. Discussion centered on the distinction between the model doing the work versus problem-specific logic in the inference code. Commenters explored how transformers, composed of matrix multiplications, process numerical inputs versus a direct mathematical operation, reinforcing the challenge's constraint that carry propagation must emerge from the autoregressive process.

Future Feature Fabrications

Users speculated on the broader implications and potential applications of such specialized, minimalist transformers. Ideas ranged from embedding these single-purpose networks with fixed weights directly into larger Language Models (LLMs) before pre-training, to the theoretical possibility of distilling an arbitrary transformer into compact, fast hardware gates. There was also a humorous jab at the general tendency to over-engineer solutions, with a suggestion to 'wrap it all in an Electron app'.