Ollama is now powered by MLX on Apple Silicon in preview

Ollama significantly boosts its Apple Silicon performance by integrating MLX, Apple's machine learning framework, promising faster local LLM operations for demanding tasks like coding agents. This technical leap leverages unified memory and new GPU Neural Accelerators, sparking keen Hacker News interest in on-device AI's privacy benefits and cost efficiency. The community debates the practicalities of hardware requirements, model quality, and the ongoing viability of local versus cloud-based LLM solutions.

Score

Comments

Highest Rank

16h

on Front Page

First Seen

Mar 31, 4:00 AM

Last Seen

Mar 31, 7:00 PM

Rank Over Time

The Lowdown

Ollama has released a preview version for Apple Silicon that dramatically enhances performance by building on Apple's MLX machine learning framework. This update aims to push the boundaries of running large language models (LLMs) locally on macOS.

Accelerated Performance: The new version offers substantial speed improvements, including faster 'time to first token' (TTFT) and increased 'tokens per second' generation speed, particularly on M-series chips leveraging GPU Neural Accelerators.
MLX Integration: By adopting MLX, Ollama takes full advantage of Apple's unified memory architecture, leading to more efficient operations.
NVFP4 Support: Integration of NVIDIA's NVFP4 format allows for higher quality responses with a reduced memory footprint, while maintaining model accuracy.
Improved Caching: An upgraded caching system features lower memory utilization, intelligent checkpoints, and smarter eviction policies, making coding and agentic tasks more efficient and responsive.
Hardware Recommendations: Optimal performance, especially with models like Qwen3.5-35B-A3B, is achieved on Macs with 32GB or more of unified memory.

This release positions Ollama as a powerful tool for developers and users looking to run sophisticated AI models directly on their Apple devices, highlighting a shift towards more capable on-device AI.

The Gossip

Local vs. Cloud Contention

A central debate revolves around the future of local versus cloud-based LLMs. Proponents of on-device AI highlight benefits like enhanced privacy, reduced data center demand, and potentially lower electricity consumption, arguing that 'most users don't need frontier model performance.' However, skeptics contend that cloud LLMs will always be faster and smarter, suggesting a complementary rather than a replacement relationship. The discussion also touches on the financial incentives behind open-source models and the potential for a chip manufacturing boom as hardware adapts to AI inferencing needs.

Performance and Hardware Hurdles

Commenters discuss the practical performance of Ollama's new MLX-powered version and the necessary hardware specifications. Many are excited by the speed improvements, with some already running large models (e.g., Qwen 70b) on M-series Macs with significant RAM (96GB). However, others express skepticism about achieving 'Claude Code' level performance on more common setups like 16GB RAM, noting that current local LLM quality can sometimes be a 'let down' compared to the hype. The necessity of 32GB+ unified memory is a recurring point of discussion regarding accessibility.

Technical Deep Dive & Distinctions

Users seek clarification on the technical differences between Ollama, llama.cpp, GGML, and GGUF, as well as comparisons to other MLX inference engines like Optiq. There's also discussion around the implications of quantization formats like NVFP4, with some noting the advertised 'less than 1% degradation' in accuracy for 'some models' as a point of careful consideration. The integration of MLX is seen as a potentially significant improvement over Ollama's previous reliance on shelling out to llama.cpp.