Making a vintage LLM from scratch

A developer chronicles his journey of building a 340M parameter "vintage LLM" from nearly scratch, meticulously training it solely on pre-1900 English texts. This technical deep dive showcases the immense effort in data curation and the iterative learning process involved, resonating with the HN crowd's appreciation for low-level experimentation and DIY spirit. The author shares both successes and humorous challenges, including the model's struggle with basic math and the discovery that some "bad OCR" was actually Welsh.

Score

Comments

#15

Highest Rank

11h

on Front Page

First Seen

Jun 12, 12:00 AM

Last Seen

Jun 12, 4:00 PM

Rank Over Time

The Lowdown

The author, croqaz, details his extensive project to construct a "vintage LLM" from the ground up, a model intentionally time-locked with a knowledge cutoff of 1900. Driven by a desire for hands-on learning and the ambition of creating a Victorian-era chatbot, he embarked on a three-month journey that involved custom data pipelines, training scripts, and careful model development.

Conceptualization & Architecture: Inspired by other historical LLM projects, the author set out to build a 340M parameter Llama-based model, aiming for English-only knowledge from before 1900.
Data Sourcing & Cleaning: Recognizing that "garbage in means garbage out," he eschewed modern datasets, painstakingly assembling a unique corpus from sources like Project Gutenberg and the Oxford Text Archive. This involved immense effort in de-duplicating, filtering out bad OCR artifacts, and verifying publication years. He experimented with various databases (Qdrant, Zvec, Lance, ValKey) before settling on LevelDB for its reliability. Custom quality filters were developed, utilizing ZLIB compression ratio, Shannon entropy, and a bespoke character quality score to ensure data integrity.
Custom Tokenization: A new tokenizer was trained on clean old English texts to exclude modern vocabulary and programming terms, preserving the vintage linguistic context.
Base Training Stages: The training involved two main stages. Initially experimenting with litGPT and Pythia models, he eventually developed his own training script due to framework limitations, drawing inspiration from nanoGPT and nanoChat. Training a 340M Llama model across cloud providers like RunPod, ThunderCompute, and Vast.ai cost approximately $80, processing around 9 billion tokens.
Fine-tuning & "Vibe Checks": The fine-tuning phase is ongoing, starting with a custom "CommonSense" dataset for basic question-answanswering. Early "vibe checks" during training illustrated the model's evolution from random noise to coherent, consistent text. An interesting experiment revealed the model's limited ability to memorize basic math operations, achieving only 59.1% accuracy for simple arithmetic.

Croqaz concludes by inviting feedback on his learning project, emphasizing that the journey itself—peeking behind the curtain of LLM development—was the primary reward, even before achieving a fully capable instruct model. He encourages others to undertake similar projects to demystify AI and fosters community engagement for future development and potential sponsorships.

The Gossip

Vibe-Coded Veracity

Some commenters expressed mixed feelings about the author's admission of using LLMs for "vibe-coding" parts of his project. While one commenter felt it detracted from the "from scratch" narrative and their interest in the journey, another defended the practice, describing it as akin to working on legacy code or a practical way to learn by reviewing and tweaking AI-generated structures.

Learning by Doing's Lasting Lessons

Many commenters resonated with the author's focus on hands-on learning, praising the depth of understanding gained from building an LLM from the ground up. This sentiment was likened to completing "Linux From Scratch," where the process, despite potential difficulties, creates a deep and lasting mental model of how complex systems function.

The Welsh Whammy

A humorous and insightful discussion arose around the author's "bad OCR" text samples, which he had discarded. Commenters quickly identified these samples as actual Welsh text, not gibberish. The irony was highlighted by one user translating a line from the supposed "useless" text, which coincidentally advised that it would be easy for the knowledgeable to fix remaining errors—a perfect encapsulation of the author's own project.

Niche LLMs' New Horizons

Several commenters discussed the potential value of specialized, historically constrained LLMs like the one developed in the post. They suggested that focusing on deliberately narrow models, rather than constantly pursuing more current and general ones, could be a "true frontier" for making smaller, more efficient models that perform well on more modest hardware.