Jamesob's guide to running SOTA LLMs locally
Jamesob presents an intensive, hardware-focused guide to building a high-performance local AI rig capable of running state-of-the-art LLMs, offering detailed configurations and parts lists for different budget levels. This guide appeals to Hacker News's DIY spirit, showcasing how to achieve near-cloud LLM performance by meticulously optimizing hardware and software. It highlights the technical prowess required to custom-build systems that bypass typical constraints, fostering autonomy in the rapidly evolving AI landscape.
The Lowdown
This GitHub guide, "jamesob's guide to running SOTA LLMs locally," provides an in-depth blueprint for assembling powerful, custom machines capable of running large language models without reliance on cloud providers. It addresses enthusiasts and professionals looking to invest significantly in local AI compute, balancing cost-effectiveness with cutting-edge performance.
- Budget Tiers: The guide outlines two main setups: a ~$2k option with 2x RTX 3090s (48GB VRAM) for models like Qwen3.6-27B and SOTA speech-to-text, and a ~$40k+ system utilizing 4x RTX 6000 Pros (384GB VRAM) for performance approaching Claude Opus.
- Hardware Philosophy: Jamesob opted for a last-gen EPYC DDR4 base system purchased from eBay to keep costs down, focusing the primary investment on high VRAM GPUs.
- Custom PCIe Interconnect: A key innovation is the use of indie PCIe4 switches (from c-payne.com) to enable direct GPU-to-GPU communication at wire speeds, significantly reducing latency for tensor parallelism.
- Detailed Configuration: The guide meticulously details necessary BIOS settings (e.g., PCIe bifurcation, forcing Gen4 link speed, disabling ASPM, enabling Re-Size BAR, disabling SR-IOV) and kernel/GRUB parameters (
iommu=off) critical for optimal performance and multi-GPU P2P communication. - ACS Disablement: It emphasizes the crucial step of disabling Access Control Services (ACS) at runtime to ensure P2P traffic stays within the switch fabric, preventing it from being routed through the CPU root port.
- Power Management: A practical solution for running such a high-power setup on a standard 110V circuit is presented, involving power limiting the RTX 6000 Pros to 350W each via
nvidia-smito manage overall draw. - Model Hosting: The guide details a local model weight hoarding strategy using ZFS and Docker Compose for isolating and serving models with
vLLMconfigurations. - Performance Metrics: The resulting system achieves Gen4 line rate P2P performance of 27.5 GB/s unidirectional / 50.4 GB/s bidirectional with sub-microsecond latency.
Ultimately, jamesob's guide stands as a comprehensive and highly technical resource for anyone aspiring to build an optimized, high-performance local LLM workstation. It masterfully combines strategic hardware choices with intricate system-level configurations to deliver top-tier AI inference capabilities outside of conventional cloud ecosystems.