Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU
NTransformer, a new C++/CUDA inference engine, demonstrates the ability to run Llama 3.1 70B on a single RTX 3090 by ingeniously streaming model layers directly from NVMe SSD to GPU memory. This technical feat bypasses the CPU entirely, achieving a 33x speedup over mmap baselines for large models on consumer hardware. The project pushes the boundaries of local LLM inference, sparking excitement and discussion on optimizing resource-constrained AI workloads.
The Lowdown
The 'NTransformer' project introduces a high-efficiency C++/CUDA LLM inference engine designed to overcome the VRAM limitations of consumer GPUs for large language models. Its core innovation lies in a 3-Tier Adaptive Caching system that streams model layers directly from NVMe storage to GPU memory, bypassing the CPU altogether.
- Core Capability: Runs Llama 3.1 70B (a typically VRAM-hungry model) on an RTX 3090 with only 24GB VRAM, consuming 23.1GB VRAM and 51GB RAM.
- Performance: Achieves 0.2 tokens/second for the 70B model, representing a 33x speedup compared to an mmap baseline that struggles with page cache thrashing.
- Key Innovation - NVMe Direct Streaming: Model weights are written directly to raw NVMe blocks. A userspace NVMe driver binds the SSD to VFIO, allowing direct DMA reads to pinned GPU-accessible memory, eliminating CPU involvement in the data path.
- 3-Tier Adaptive Caching: Layers are dynamically managed across VRAM (resident, zero I/O), pinned RAM (host-to-device DMA), and NVMe/mmap (fallback), with tier sizes automatically adjusted based on available hardware.
- Minimal Dependencies: Built in C++/CUDA with zero external dependencies beyond the CUDA Toolkit, promoting a lean and efficient codebase.
- Roadmap: Future plans include advanced quantization techniques (e.g., INT2 KV-cache), support for novel architectures like Mamba, and API polish.
This project represents a significant step towards making larger LLMs more accessible for local inference on commodity hardware, demonstrating a clever approach to memory management and data transfer bottlenecks.
The Gossip
Token Throughput Troubles
While the technical achievement of running a 70B model on an RTX 3090 is lauded, many commenters express concern about the practical utility of its 0.2 tokens/second performance. This speed, translating to 5 seconds per token, is considered too slow for interactive use cases. The discussion often compares this to the better latency offered by smaller, resident models (like 8B or 13B) or even CPU-based inference for larger models, though some argue that for specific, high-quality, non-interactive tasks, the ability to run such a large model locally is a significant breakthrough worth the latency trade-off.
MoE's Memory Management Musings
A prominent theme revolves around the potential application of this tiered memory and streaming approach to Mixture-of-Experts (MoE) models. Commenters speculate on using VRAM for frequently active experts, RAM for less common ones, and NVMe for rarely accessed experts, effectively creating a multi-tier MoE. The debate touches on the complexities of dynamically routing and swapping experts, the challenges of balancing expert utilization during training, and whether predictive layer/expert swapping could maximize bandwidth, though some note that this is an active area of research and current MoE models already leverage similar VRAM-hot strategies.
Architectural Aspirations & Alternatives
The discussion dives into the technical implementations and potential alternatives for direct GPU data access. Commenters inquire about similarities to DirectX's API for loading assets directly to GPU memory and compare the project's 'gpu-nvme-direct' backend to other existing direct-to-GPU DMA solutions. There's also speculation about extending these techniques to even larger models (e.g., 1T models), the role of PCI-P2P (GPU-Direct), and suggestions for optimizing VRAM usage for smaller models to further leverage the NVMe streaming capabilities.