Running local models on an M4 with 24GB memory

Johanna Larsson shares her journey and optimized setup for running local AI models on an M4 MacBook with 24GB of memory. Acknowledging that local models don't match the output of state-of-the-art cloud services, she emphasizes the excitement and benefits of having AI capabilities without an internet connection or dependence on major tech companies, despite the initial setup complexities.

Setup Challenges: The author details difficulties in choosing a runner (Ollama, llama.cpp, LM Studio) and selecting models that fit memory constraints while offering sufficient context windows. Models like Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B were technically viable but impractical, while Gemma 4B ran but struggled with tool use.
Optimal Configuration: The most effective model found was qwen3.5-9b@q4_k_s running on LM Studio, achieving about 40 tokens per second with "thinking enabled" and successful tool use, alongside a 128K context window. Specific settings for "thinking mode" and coding tasks, including temperature and top_p, are provided.
Tool Integration: The post includes configuration snippets for integrating the local model with pi.dev and OpenCode.ai, noting pi's snappier feel but also its extensive customization that might lead to over-tweaking.
Local vs. SOTA Models: A key distinction is drawn between local models and SOTA cloud models; local models require more interactive, step-by-step guidance rather than independent problem-solving. This hands-on approach, while less autonomous, is argued to foster greater engagement and prevent cognitive offloading.
Practical Examples: Two examples illustrate capability: the model successfully suggested fixes for Elixir linter warnings (minor task, but convenient) but struggled with fully resolving a Git conflict, failing to execute the changes after identifying the correct resolution strategy.

In conclusion, while local LLMs have significant tradeoffs compared to their cloud-based counterparts, they offer attractive benefits such as offline operation, reduced running costs (beyond initial hardware), lower individual environmental impact, and the sheer enjoyment of tinkering. The author suggests that experimenting with local models provides a more sustainable and positive interaction with AI technology, fostering engagement even when the models make mistakes.

Running local models on an M4 with 24GB memory

The Lowdown