GLM-OCR: Accurate × Fast × Comprehensive
GLM-OCR introduces a new multimodal OCR model for complex document understanding, built on an encoder-decoder architecture with novel training techniques. It achieves state-of-the-art performance on major benchmarks while maintaining high efficiency and real-world applicability. Its open-source nature, flexible deployment options, and robust SDK make it a compelling choice for developers on Hacker News interested in cutting-edge OCR solutions.
The Lowdown
GLM-OCR is a novel multimodal Optical Character Recognition (OCR) model designed for understanding complex documents. It leverages a GLM-V encoder-decoder architecture, incorporating Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to boost training efficiency, recognition accuracy, and generalization. The model integrates a CogViT visual encoder, a lightweight cross-modal connector, and a GLM-0.5B language decoder, all within a two-stage pipeline for robust OCR performance.
Key features of GLM-OCR include:
- State-of-the-Art Performance: It scores 94.62 on OmniDocBench V1.5, ranking #1, and excels across document understanding benchmarks, including formula, table, and information extraction.
- Optimized for Real-World Scenarios: Designed for practical business use, it maintains strong performance on challenging layouts like complex tables, code, and seals.
- Efficient Inference: With only 0.9B parameters, it supports deployment via vLLM, SGLang, and Ollama, reducing latency and compute costs for high-concurrency or edge environments.
- Easy to Use: Fully open-sourced with a comprehensive SDK and toolchain, offering simple installation and integration.
Users can interact with GLM-OCR through various deployment options: a hosted Zhipu MaaS API for quick starts without GPUs, self-hosting with vLLM or SGLang for full control, or specialized deployments for Apple Silicon (mlx-vlm) and Ollama. The provided SDK offers CLI, Python API, and Flask service interfaces for parsing images and documents, outputting results in JSON or Markdown formats. The model's modular architecture allows for customization by extending its core components like PageLoader, OCRClient, and ResultFormatter. The project is open-sourced under the Apache License 2.0 (code) and MIT License (model), with the integrated PP-DocLayoutV3 also under Apache License 2.0.