Running Google Gemma 4 Locally with LM Studio's New Headless CLI and Claude Code

This article explores the practical advantages of running large language models (LLMs) locally, circumventing the costs, latency, and privacy concerns associated with cloud AI APIs. It focuses on Google's Gemma 4, specifically the 26B-A4B Mixture-of-Experts (MoE) model, and how it can be efficiently deployed on a consumer laptop like a MacBook Pro using LM Studio's recently updated tooling.

Gemma 4's Local Appeal: The Gemma 4 26B-A4B MoE model is chosen for its efficiency, activating only 4B parameters per forward pass, which provides a quality comparable to a 10B dense model at a 4B inference cost, making it suitable for hardware with unified memory like Apple Silicon.
LM Studio's Headless Transformation: Version 0.4.0 of LM Studio introduces the llmster daemon and lms command-line interface (CLI), enabling headless operation. This allows for server deployments, CI/CD integration, and console-based workflows without the GUI.
Practical CLI Usage: The guide details installation of the lms CLI, starting the daemon, downloading Gemma 4, and initiating chat sessions, along with commands to monitor model status and performance metrics (tokens/second, time to first token).
Memory Management and Optimization: Critical aspects of local LLM deployment are covered, including estimating memory requirements at various context lengths (--estimate-only), configuring GPU offloading, managing concurrent requests, and setting model Time-To-Live (TTL).
Integration with Claude Code: A significant feature is demonstrating how to configure Claude Code to use the local LM Studio server via an Anthropic-compatible API endpoint. This enables fully offline, privacy-sensitive coding assistance by overriding environment variables to point Claude Code to the local Gemma 4 model.
Performance and Limitations: The author benchmarks Gemma 4 at 51 tokens/second on an M4 Pro. While noting the benefits of MoE models and the headless CLI, limitations such as slower performance compared to cloud APIs, conservative default context lengths, and memory pressure on 48GB machines are acknowledged.

The article concludes by emphasizing the transformative potential of MoE models for local inference and the workflow improvements brought by LM Studio's headless capabilities, offering a compelling case for migrating AI workflows from cloud to local hardware for specific use cases.

Running Google Gemma 4 Locally with LM Studio's New Headless CLI and Claude Code

The Lowdown

The Gossip

Tooling Talk: Simplicity and Stability

Claude's Compatibility Conundrum