HN
Today

How to Setup a Local Coding Agent on macOS

Kyle Howells meticulously details his journey to establish a fast, local coding agent on macOS, leveraging Gemma 4 with llama.cpp and Multi-Token Prediction for significant performance gains. He provides comprehensive benchmarks, demonstrating how to achieve usable speeds and multimodal capabilities on Apple Silicon. This guide offers a practical, step-by-step approach for engineers seeking self-hosted, high-performance AI tooling, a topic always popular for its blend of technical depth and utility on Hacker News.

16
Score
3
Comments
#3
Highest Rank
19h
on Front Page
First Seen
Jun 12, 6:00 PM
Last Seen
Jun 13, 12:00 PM
Rank Over Time
103343465899121212111011913

The Lowdown

The author, Kyle Howells, embarked on a mission to establish a robust, offline coding agent environment on his macOS machine, motivated by internet outages and the promise of improved performance from Gemma 4's Multi-Token Prediction (MTP) update. The goal was a setup that was fast, compatible with OpenAI's API for broader tool integration, and capable of handling visual input like screenshots.

  • Core Setup: The final configuration utilizes llama.cpp built with Metal acceleration, the Gemma 4 26B-A4B main model (16GB), a Q8 MTP draft model for speculative decoding, the Gemma 4 multimodal projector, and the Pi terminal coding agent. This was tested on an Apple M1 Max with 64GB unified memory.
  • Performance Enhancement: Initial benchmarks showed the main Gemma 4 model achieving 58.2 tokens/second. Integrating the MTP draft model significantly boosted generation speed to 72.2 tokens/second (a 24% improvement) by optimizing --spec-draft-n-max to 3.
  • MLX Comparison: Surprisingly, llama.cpp with Metal and MTP proved faster than MLX-based models for this specific task, highlighting the continuous optimization within the llama.cpp project.
  • Multimodal Capability: The setup was enhanced with image input support via the Gemma 4 multimodal projector, allowing the Pi agent to process screenshots without impacting text generation speed.
  • Practical Implementation: The article provides detailed, step-by-step instructions for installing llama.cpp, downloading the necessary GGUF model files (main, draft, and projector), starting an OpenAI-compatible local llama-server, and configuring the Pi agent for both text and image input.
  • Alternative Model Consideration: While acknowledging that Qwen3.6 35B-A3B might be a "much better coding agent" in terms of quality, the author opted for Gemma 4 due to its superior speed (72 tok/s vs 55 tok/s for Qwen3.6) on his hardware, prioritizing usability.

In conclusion, the guide successfully outlines a practical and performant method for running a local, multimodal coding agent on macOS using Gemma 4. The key takeaway is the substantial benefit of incorporating the MTP draft model for speed and the simple integration of image capabilities, all while maintaining an OpenAI-compatible interface for ease of use with existing tools.