Launch HN: Expanse (YC P26) – Unlock Wasted GPU Capacity
Tired of your GPUs sitting idle while your budget burns? Expanse has launched to tackle the colossal waste in HPC and GPU clusters, where over-provisioning due to fear of job failure costs millions. Their AI-powered solution precisely predicts resource needs and prevents crashes, a deep technical dive into efficiency that's music to HN's ears.
The Lowdown
Expanse, a Y Combinator-backed startup, is tackling the pervasive and costly problem of underutilized GPU and High-Performance Computing (HPC) clusters. The team, comprising former HPC practitioners, recognized that current industry practices lead to massive waste—often 59% of compute capacity—as users dramatically over-request resources to avoid job-killing failures. Expanse's solution leverages advanced machine learning to precisely match resources to workloads, aiming to unlock millions in wasted compute power.
- The Core Problem: HPC and GPU datacenters suffer from severe underutilization (30-40% effective use) because users "over-request" resources by 2-3 times to prevent job failures, which can cost days of work. This leads to immense financial waste, with one cluster alone wasting an estimated $8.5M per month.
- Expanse's Approach: The platform installs on cluster nodes and integrates with schedulers like SLURM and Kubernetes. It uses live hardware telemetry (DCGM, CUPTI, Cgroups), job source code, and submission scripts as inputs.
- AI-Powered Predictions: Deep learning models process this multimodal data to accurately predict necessary GPU VRAM, utilization, memory, CPUs, and walltime. These models are fine-tuned for each cluster and prioritize slight over-provisioning to maintain job stability.
- Triple Threat Capabilities:
- Resource Prediction: Provides precise resource recommendations at submission time, coupled with confidence intervals and proactive failure warnings (e.g., OOM).
- Live Observability: Offers real-time hardware telemetry and code stack profiling through an intuitive dashboard with minimal performance overhead.
- Failure Diagnosis: Post-failure, it correlates collected data to generate solution-oriented logs, pinpointing causes and suggesting code-level fixes.
- Outperforming the State of the Art: Expanse significantly outperforms traditional methods like historical averages or heuristic rules. Critically, their models also beat frontier LLMs (Gemini, Claude, GPT, Codex) by an 8x margin on prediction tasks, attributing LLM limitations to their lack of native support for multimodal inputs like code structure and hardware performance data.
- Business Model & Call to Action: Expanse is currently offering paid pilot programs, starting with a two-week capacity assessment for interested HPC/GPU cluster operators (100+ GPUs). They are actively seeking feedback on their approach and potential new prediction use cases. Expanse offers a technically sophisticated, AI-driven solution to a pressing economic and operational challenge in high-performance computing. By precisely predicting resource needs and proactively preventing failures, they promise to transform inefficient, over-provisioned clusters into highly optimized engines, saving millions and accelerating research. Their performance claims against both traditional and cutting-edge LLM-based approaches highlight a unique and valuable market position.