Rolling your own serverless OCR in 40 lines of code
This story details how to build a serverless OCR pipeline in just 40 lines of Python, leveraging Modal for GPU access and DeepSeek OCR for robust mathematical parsing. It gained traction on HN for showcasing a practical, cost-effective way to digitize thousands of pages, highlighting the power of modern serverless platforms and open-source AI models. The approach demonstrates how minimal code can orchestrate complex, resource-intensive tasks, democratizing access to powerful AI capabilities for personal projects.
The Lowdown
The author sought a cost-effective way to make a large textbook searchable for an AI agent, finding existing commercial OCR solutions too expensive or limited for thousands of pages. This led to a DIY approach utilizing a serverless platform.
- Modal as the Backbone: The core solution hinges on Modal, a serverless compute platform that allows running Python code on cloud infrastructure with GPU access, charging only for active compute time. Its decorator-based syntax simplifies deploying complex machine learning workflows.
- DeepSeek OCR for Accuracy: The author selected DeepSeek's open OCR model, specifically for its strong performance in parsing mathematical notation, a crucial feature for technical textbooks.
- Elegant Implementation:
- A custom container image is built with all necessary dependencies (PyTorch, transformers, image processing libraries).
- A FastAPI server is deployed on Modal, wrapped by
@modal.asgi_app(), which handles GPU provisioning and HTTP request routing. - The OCR model loads once per container, ensuring efficiency by reusing it across subsequent requests.
- Batched inference is employed to process multiple pages simultaneously, optimizing GPU usage.
- A local client, marked by
@app.local_entrypoint(), handles sending PDF pages to the Modal-deployed server. - High-resolution rendering (2x zoom) of PDF pages is used to improve OCR accuracy for small text and symbols.
- Post-processing cleans up the DeepSeek OCR output by removing grounding tags, leaving clean markdown.
- Impressive Results: A 600-page textbook was processed in about 45 minutes on an A100 GPU for approximately $2, yielding high-quality, searchable markdown with equations largely intact. This enables applications like
grep-ing, feeding sections into LLMs, or building search indexes.
This setup offers a powerful and reusable template for anyone looking to digitize large collections of PDFs, demonstrating that advanced OCR capabilities are accessible without significant infrastructure management or high costs.
The Gossip
Semantic Serverless Scrutiny
Users debate the article's title, questioning if "serverless" truly applies when utilizing powerful cloud GPUs and if the claim of "40 lines of code" is misleading given the heavy reliance on extensive external models and frameworks. Others clarify that "serverless" refers to abstracting away infrastructure management, regardless of the underlying computational resources, and acknowledge that modern development often involves leveraging substantial existing libraries.
Comparing Character Converters
The discussion delves into the choice of DeepSeek OCR, with several commenters pointing out that it might no longer be the state-of-the-art. Alternatives like `dots`, `olmOCR`, and the newer DeepSeek-OCR2 are suggested, with the author acknowledging the value of this new information. There's also a robust debate comparing it to Tesseract, with some users highlighting Tesseract's surprising effectiveness for specific use cases and the utility of `pdftotext` for PDFs with embedded text.
Deployment Deliberations
Commenters raise practical questions regarding deploying such a solution in a serverless cloud environment. Specific inquiries include optimizing model loading (e.g., pre-baking models into the container image versus downloading on demand) and security concerns around publicly accessible OCR endpoints. One user also speculates on the legality of DeepSeek OCR's reported use for generating training data for other large language models at scale.