HN
Today

Running Google Gemma 4 Locally with LM Studio's New Headless CLI and Claude Code

This technical deep dive demonstrates how to run Google's Gemma 4 (a Mixture-of-Experts LLM) locally on consumer hardware using LM Studio's new headless CLI. It highlights the efficiency of MoE models for local inference and provides a detailed guide to integrate Gemma 4 with Claude Code for offline coding assistance. The post resonates with HN's interest in privacy-preserving, cost-effective, and performant local AI solutions, offering practical steps and benchmarks.

30
Score
7
Comments
#5
Highest Rank
24h
on Front Page
First Seen
Apr 5, 7:00 PM
Last Seen
Apr 6, 6:00 PM
Rank Over Time
66656691011121512151518151717161518232729

The Lowdown

This article explores the practical advantages of running large language models (LLMs) locally, circumventing the costs, latency, and privacy concerns associated with cloud AI APIs. It focuses on Google's Gemma 4, specifically the 26B-A4B Mixture-of-Experts (MoE) model, and how it can be efficiently deployed on a consumer laptop like a MacBook Pro using LM Studio's recently updated tooling.

  • Gemma 4's Local Appeal: The Gemma 4 26B-A4B MoE model is chosen for its efficiency, activating only 4B parameters per forward pass, which provides a quality comparable to a 10B dense model at a 4B inference cost, making it suitable for hardware with unified memory like Apple Silicon.
  • LM Studio's Headless Transformation: Version 0.4.0 of LM Studio introduces the llmster daemon and lms command-line interface (CLI), enabling headless operation. This allows for server deployments, CI/CD integration, and console-based workflows without the GUI.
  • Practical CLI Usage: The guide details installation of the lms CLI, starting the daemon, downloading Gemma 4, and initiating chat sessions, along with commands to monitor model status and performance metrics (tokens/second, time to first token).
  • Memory Management and Optimization: Critical aspects of local LLM deployment are covered, including estimating memory requirements at various context lengths (--estimate-only), configuring GPU offloading, managing concurrent requests, and setting model Time-To-Live (TTL).
  • Integration with Claude Code: A significant feature is demonstrating how to configure Claude Code to use the local LM Studio server via an Anthropic-compatible API endpoint. This enables fully offline, privacy-sensitive coding assistance by overriding environment variables to point Claude Code to the local Gemma 4 model.
  • Performance and Limitations: The author benchmarks Gemma 4 at 51 tokens/second on an M4 Pro. While noting the benefits of MoE models and the headless CLI, limitations such as slower performance compared to cloud APIs, conservative default context lengths, and memory pressure on 48GB machines are acknowledged.

The article concludes by emphasizing the transformative potential of MoE models for local inference and the workflow improvements brought by LM Studio's headless capabilities, offering a compelling case for migrating AI workflows from cloud to local hardware for specific use cases.

The Gossip

Tooling Talk: Simplicity and Stability

The discussion quickly turned to the practicalities of setting up local LLMs, comparing LM Studio (featured in the article) with Ollama. While the author's setup provides a detailed method, some commenters suggested Ollama offers a simpler experience, even providing a direct command. However, one user noted that LM Studio and Claude Code sometimes lose their place, causing interruptions, and preferred Ollama for better stability in local development.

Claude's Compatibility Conundrum

A key point of interest revolved around the interaction between Google's Gemma model and Anthropic's Claude Code. Commenters questioned how this integration is possible and whether Anthropic, known for its specific usage guidelines, would eventually make it harder for third-party models to use their frontend. The technical explanation clarified that LM Studio provides an Anthropic-compatible API endpoint, allowing Claude Code to be redirected to a local model.