How to Setup a Local Coding Agent on macOS
Kyle Howells meticulously details his journey to establish a fast, local coding agent on macOS, leveraging Gemma 4 with llama.cpp and Multi-Token Prediction for significant performance gains. He provides comprehensive benchmarks, demonstrating how to achieve usable speeds and multimodal capabilities on Apple Silicon. This guide offers a practical, step-by-step approach for engineers seeking self-hosted, high-performance AI tooling, a topic always popular for its blend of technical depth and utility on Hacker News.
The Lowdown
The author, Kyle Howells, embarked on a mission to establish a robust, offline coding agent environment on his macOS machine, motivated by internet outages and the promise of improved performance from Gemma 4's Multi-Token Prediction (MTP) update. The goal was a setup that was fast, compatible with OpenAI's API for broader tool integration, and capable of handling visual input like screenshots.
- Core Setup: The final configuration utilizes
llama.cppbuilt with Metal acceleration, the Gemma 4 26B-A4B main model (16GB), a Q8 MTP draft model for speculative decoding, the Gemma 4 multimodal projector, and thePiterminal coding agent. This was tested on an Apple M1 Max with 64GB unified memory. - Performance Enhancement: Initial benchmarks showed the main Gemma 4 model achieving 58.2 tokens/second. Integrating the MTP draft model significantly boosted generation speed to 72.2 tokens/second (a 24% improvement) by optimizing
--spec-draft-n-maxto 3. - MLX Comparison: Surprisingly,
llama.cppwith Metal and MTP proved faster than MLX-based models for this specific task, highlighting the continuous optimization within thellama.cppproject. - Multimodal Capability: The setup was enhanced with image input support via the Gemma 4 multimodal projector, allowing the
Piagent to process screenshots without impacting text generation speed. - Practical Implementation: The article provides detailed, step-by-step instructions for installing
llama.cpp, downloading the necessary GGUF model files (main, draft, and projector), starting an OpenAI-compatible localllama-server, and configuring thePiagent for both text and image input. - Alternative Model Consideration: While acknowledging that Qwen3.6 35B-A3B might be a "much better coding agent" in terms of quality, the author opted for Gemma 4 due to its superior speed (72 tok/s vs 55 tok/s for Qwen3.6) on his hardware, prioritizing usability.
In conclusion, the guide successfully outlines a practical and performant method for running a local, multimodal coding agent on macOS using Gemma 4. The key takeaway is the substantial benefit of incorporating the MTP draft model for speed and the simple integration of image capabilities, all while maintaining an OpenAI-compatible interface for ease of use with existing tools.