Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

Hacker News delves into the cutting-edge question of replacing cloud-based AI coding assistants like Claude and GPT with local models. While many report success, especially for personal projects, the consensus points to a persistent performance gap for complex professional tasks, balanced against significant hardware investment and the allure of data sovereignty. The discussion provides a deep technical dive into specific models, hardware configurations, and the evolving ecosystem of local AI tooling.

146

Score

Comments

Highest Rank

on Front Page

First Seen

Jun 15, 4:00 PM

Last Seen

Jun 15, 6:00 PM

Rank Over Time

The Lowdown

The Ask HN post queries whether developers are successfully swapping premium cloud-based AI coding assistants for local models in their daily workflow. This question sparked a vibrant discussion, with many users sharing their practical experiences, hardware setups, and the nuanced trade-offs involved.

Feasibility & Performance: Many users report being able to replace a significant portion of their coding tasks with local models, particularly for personal use or less complex projects. However, nearly all acknowledge that local models still lag behind frontier cloud models in raw intelligence and ability to handle highly complex, one-shot problems.
Hardware Requirements: Running capable local models demands substantial hardware, typically powerful GPUs with ample VRAM (e.g., RTX 3090, RTX 6000, M4 Max Macs, Strix Halo). The upfront cost of this hardware can be equivalent to several years of cloud subscriptions, leading to a cost-benefit analysis by many.
Preferred Models: Qwen 3.6 (27B/35B), Gemma 4 (26B/31B), and DeepSeek V4 Flash emerge as popular choices for coding tasks. Users often highlight that specific dense models (like Qwen 3.6 27B) can surprisingly outperform larger Mixture of Experts (MoE) versions of previous generations.
Tooling & Workflows: The effectiveness of local models is heavily dependent on the surrounding tooling and workflows. llama.cpp is a favored inference engine, and agentic harnesses like pi.dev are crucial for orchestrating tasks, chaining models, and providing necessary context and external capabilities (e.g., web search). Some express concerns about ollama's direction, recommending llama.cpp instead.
The 'Gap' Argument: A recurring theme is the perceived and actual performance gap between open-source local models and proprietary cloud giants. Some believe this gap is intentional or a natural outcome of differing business models and compute resources, while others remain optimistic that local capabilities will continue to catch up.

In essence, local AI for coding is now a viable, if demanding, option for many, particularly those prioritizing control, privacy, or minimizing recurring subscription costs. However, it requires a significant initial investment in hardware and a willingness to tinker with configurations and workflows to bridge the gap with the ever-advancing capabilities of commercial cloud services.

The Gossip

Frontier Frictions

Many users are actively experimenting with local models for coding, successfully replacing a notable portion of their daily tasks, especially for personal projects. However, a common refrain is that while local models are 'good enough' for many scenarios and have improved dramatically, they still lag behind the 'frontier' cloud models (like Claude Opus/Fable or GPT-5.5) in raw intelligence and the ability to handle complex, one-shot tasks, particularly in professional environments where the opportunity cost of inefficiency is high.

Hardware Horsepower & High Costs

The discussion is rich with details about specific hardware configurations and their resulting performance. Users frequently elaborate on their GPU setups (e.g., dual RTX 3090s, RTX 6000, M4 Max Macs, Strix Halo chips) and report tokens per second (tok/s). A significant barrier highlighted is the substantial upfront cost of powerful GPUs, which often equals years of cloud subscriptions, and the ongoing electricity consumption, making the cost-effectiveness a complex calculation.

Model Mêlée: Qwen, Gemma & DeepSeek

Specific open-source models are central to the conversation, with Qwen 3.6 (in its 27B/35B variants), Gemma 4 (26B/31B), and DeepSeek V4 Flash frequently cited as top performers for coding. Users often compare their efficacy, noting that dense models like Qwen 3.6 27B can surprisingly outperform larger Mixture of Experts (MoE) versions (e.g., Qwen 3.5 122B) for coding tasks. The importance of specific quantizations (like Q4_K_XL, A4B, A3B) and efficient context window management is also frequently emphasized.

Tooling, Tweaks, and Agent Architectures

Beyond raw model performance, users highlight the critical role of supporting software and sophisticated workflows. Tools like `llama.cpp` for inference and `pi.dev` as an agentic harness are discussed as essential for orchestrating tasks. Specific configurations, including `crush`, `headroom`, and web search integration (e.g., via Exa), are detailed. Several commenters also discuss building complex agentic workflows, chaining different models for specialized sub-tasks to improve overall effectiveness and 'ground' the models. A notable sub-theme is a growing sentiment that `ollama` is becoming less ideal, with recommendations often steering towards `llama.cpp` for local inference.