HN
Today

From zero to a RAG system: successes and failures

This post chronicles an engineer's real-world odyssey building a RAG system from scratch for a massive internal document corpus, detailing the many practical challenges and hard-won solutions. It's a pragmatic, step-by-step account of navigating technical hurdles from document chaos to GPU bottlenecks. Hacker News readers appreciate this type of hands-on, problem-solving narrative that demystifies complex AI implementations.

19
Score
5
Comments
#3
Highest Rank
9h
on Front Page
First Seen
Mar 26, 11:00 AM
Last Seen
Mar 26, 7:00 PM
Rank Over Time
43454671111

The Lowdown

The author shares a detailed, candid account of their journey to develop an internal Retrieval Augmented Generation (RAG) system with a local LLM. Tasked with creating a chat tool to answer questions from a decade's worth of company projects and 1TB of unstructured data, the project presented numerous unexpected challenges, transforming what seemed like a simple task into an "emotional roller coaster."

  • Technology Stack Selection: Initially, the author chose Ollama for local LLaMA models, nomic-embed-text for embeddings, LlamaIndex for RAG orchestration, and Python for development. Early tests were promising, leading to initial overconfidence.
  • Document Chaos Mitigation: Faced with hundreds of gigabytes of unorganized documents, the system crashed due to processing irrelevant large files (videos, images, backups). The solution involved implementing robust filtering by file extension and name patterns, reducing the data by 54% and resolving memory issues.
  • Scaling Indexing: The default LlamaIndex JSON storage proved unmanageable for 451GB of data, leading to slow processing and data corruption. Migrating to ChromaDB as a dedicated vector database enabled batch processing, checkpointing, and reliable storage, fundamentally changing the indexing approach.
  • Hardware Bottleneck: Local CPU processing was too slow. A virtual machine with an NVIDIA RTX 4000 SFF Ada GPU was rented, significantly accelerating the indexing process (which still took several weeks), costing 184 euros.
  • User Experience & Deployment: A Flask API and Streamlit UI were built for the frontend. A key challenge was serving document references without storing the entire 451GB corpus on the production VM's limited disk space. This was solved by using Azure Blob Storage to serve original documents via SAS tokens on demand.
  • Final Architecture: The robust solution integrated Ollama, nomic-embed-text, ChromaDB (HNSW), LlamaIndex, Flask, Streamlit, Docker Compose, NVIDIA Container Toolkit, and Azure Blob Storage.
  • Lessons Learned: Key takeaways included managing memory with batch processing and explicit garbage collection, implementing error tolerance for problematic files, using checkpoints for long-running processes, and thorough monitoring.

While acknowledging it's not a perfect system, the author expresses satisfaction with the fast, reliable, and useful tool delivered to colleagues. The primary advice for others building similar systems is to invest heavily in building and curating high-quality source data.

The Gossip

ChromaDB Clarification Conundrum

Commenters quickly jumped to correct the author's initial, slightly confusing mention of ChromaDB. While the author later clarified it's an open-source database (Apache-2.0), an early phrasing implying a connection to 'Google's database' or 'Chrome/Chromium' caused some to question the article's credibility, despite the overall quality of the technical detail.

Literary RAG's Local Longings

One significant discussion thread revolved around applying RAG principles to personal or academic literature review. A commenter shared their struggle to find an 'out-of-the-box' solution for running RAG on local bibliographic collections (like Zotero PDFs), reflecting the practical challenges the article itself highlighted and seeking alternatives to building a system from scratch.