Don't rent the cloud, own instead
Comma.ai shares an in-depth look at their $5M on-premise data center, advocating for owning compute infrastructure over renting cloud services. They argue this approach significantly reduces costs and fosters superior engineering, especially for consistent, compute-intensive AI workloads. This contrarian stance sparks a fiery debate among Hacker News readers about the true total cost of ownership, operational risks, and the strategic value of infrastructure choices.
The Lowdown
Comma.ai, a company at the forefront of self-driving car technology, has openly challenged the conventional wisdom of relying solely on cloud providers by successfully operating its own $5M data center. Their blog post outlines the motivations, practicalities, and benefits of this "own instead of rent" approach, particularly for their consistent, compute-heavy machine learning workloads.
- Motivation for Owning: Comma.ai highlights avoiding cloud vendor lock-in, significantly reducing operational costs (estimating $25M cloud cost vs. $5M spent on-prem), and fostering a culture of solving "real-world challenges" in hardware and infrastructure rather than cloud-specific APIs and billing systems.
- Power & Cooling: The data center consumes about 450kW at peak, with electricity being a major expense (over $500k/year in San Diego). They leverage San Diego's mild climate for pure outside air cooling, controlled by a custom PID loop, minimizing energy consumption for climate control.
- Hardware: The setup features 600 GPUs across 75 in-house built TinyBox Pro machines for compute, alongside several racks of Dell machines with 4PB of SSD storage for data. The network infrastructure includes 100Gbps Ethernet and Infiniband for GPU interconnects.
- Software Infrastructure: For management, they use
pxebootandsalt. Distributed storage relies onminikeyvalue(mkv), including a 3PB non-redundant array for raw driving data and a redundant mkv for models and metrics.Slurmmanages compute jobs, andtorch.distributedwith FSDP is employed for PyTorch training. - Custom Tools: Comma.ai developed
miniray, an open-source lightweight task scheduler for parallel Python code on idle machines, and a custom experiment tracking service similar to Weights & Biases. - Monorepo Workflow: All code resides in a small NFS monorepo. Local changes are cached on a shared drive for distributed jobs, ensuring consistent environments and package management via
UV. - Complex Workflows: The infrastructure is specifically designed to support sophisticated tasks, such as on-policy driving model training, where data generation occurs concurrently with training runs.
Comma.ai presents a compelling case for vertical integration of compute infrastructure, demonstrating that with sufficient expertise and scale, owning a data center can be a strategic and cost-effective decision, particularly for companies with specialized and consistent high-performance computing needs.
The Gossip
The Cloud vs. Cash Calculation
The central debate on HN revolves around the financial and operational trade-offs between cloud and on-premise solutions. Many commenters agree that for large, consistent workloads like comma.ai's, self-hosting or bare metal can be significantly cheaper (often 3x to 5x less than cloud). However, critics highlight the hidden costs of on-prem, such as high upfront capital expenditure, increased operational risk (hardware failures, disaster recovery, security), and the necessity for specialized in-house IT/DevOps talent. Some argue that cloud's flexibility and OPEX model are crucial for startups and variable workloads, while others counter that the operational burden of cloud-specific APIs and vendor lock-in negates some of these benefits. Colocation is frequently mentioned as a pragmatic middle ground.
Competence Cultivation and Control
A significant theme discusses the type of engineering culture and expertise fostered by each infrastructure approach. The article's emphasis on solving "real-world challenges" with hardware versus mastering "company-specific APIs" resonated, with many commenters preferring the deeper technical knowledge required for on-premise infrastructure. This sparked discussions about whether internal teams are "competent" enough to run their own infrastructure, with some arguing that developers and sysadmins often underestimate their capabilities. The desire for greater control, data sovereignty (especially noted for European companies), and avoiding cloud "lock-in" were strong motivators for on-premise advocates.
Disaster Dread and Duplication
Several comments highlight the critical importance of disaster recovery and redundancy, often cited as major advantages of hyperscale cloud providers. While comma.ai operates a single data center, many question its contingency plans for catastrophic events like fires or floods, contrasting this with the built-in redundancy of cloud regions. Some propose that if a company is building one data center, building two (or strategically collocating) for resilience, even at a higher initial cost, would still be more economical than a full cloud migration in the long run, emphasizing the cost of downtime.
Hybrid Hardware Horizons
Commenters expand the discussion beyond a simple binary choice, presenting various hybrid and intermediate solutions for infrastructure. These include managed private cloud services, renting bare metal (e.g., from providers like Hetzner), or buying hardware and collocating it in a third-party data center. These options aim to capture some cost benefits of owning while offloading physical infrastructure management, power, and cooling. The consensus is that the optimal solution depends heavily on the company's scale, specific workload characteristics, available capital, and risk tolerance, suggesting a spectrum of viable approaches.