Running local models on an M4 with 24GB memory
This post dives into the practicalities of running local AI models on Apple's M4 chip, providing a detailed guide on setup and configuration for developers. It highlights the benefits of local execution, like independence from big tech and cost-effectiveness, appealing to the Hacker News ethos of technical autonomy. The author shares their experience with specific models and tools, offering a realistic perspective on local AI's capabilities and limitations for everyday tasks.
The Lowdown
Johanna Larsson shares her journey and optimized setup for running local AI models on an M4 MacBook with 24GB of memory. Acknowledging that local models don't match the output of state-of-the-art cloud services, she emphasizes the excitement and benefits of having AI capabilities without an internet connection or dependence on major tech companies, despite the initial setup complexities.
- Setup Challenges: The author details difficulties in choosing a runner (Ollama, llama.cpp, LM Studio) and selecting models that fit memory constraints while offering sufficient context windows. Models like Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B were technically viable but impractical, while Gemma 4B ran but struggled with tool use.
- Optimal Configuration: The most effective model found was
qwen3.5-9b@q4_k_srunning on LM Studio, achieving about 40 tokens per second with "thinking enabled" and successful tool use, alongside a 128K context window. Specific settings for "thinking mode" and coding tasks, including temperature and top_p, are provided. - Tool Integration: The post includes configuration snippets for integrating the local model with
pi.devandOpenCode.ai, notingpi's snappier feel but also its extensive customization that might lead to over-tweaking. - Local vs. SOTA Models: A key distinction is drawn between local models and SOTA cloud models; local models require more interactive, step-by-step guidance rather than independent problem-solving. This hands-on approach, while less autonomous, is argued to foster greater engagement and prevent cognitive offloading.
- Practical Examples: Two examples illustrate capability: the model successfully suggested fixes for Elixir linter warnings (minor task, but convenient) but struggled with fully resolving a Git conflict, failing to execute the changes after identifying the correct resolution strategy.
In conclusion, while local LLMs have significant tradeoffs compared to their cloud-based counterparts, they offer attractive benefits such as offline operation, reduced running costs (beyond initial hardware), lower individual environmental impact, and the sheer enjoyment of tinkering. The author suggests that experimenting with local models provides a more sustainable and positive interaction with AI technology, fostering engagement even when the models make mistakes.