Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

Cactus Compute has unveiled Needle, a 26M parameter function-calling model engineered for ultra-efficient, on-device AI. This 'Simple Attention Network' eschews traditional FFNs, making it significantly smaller and faster than comparable models while still outperforming them on specific tasks. Its novel architecture and focus on local execution for AI agents caught Hacker News's attention, sparking discussions on practical applications and the ethics of model distillation.

151

Score

Comments

Highest Rank

17h

on Front Page

First Seen

May 12, 7:00 PM

Last Seen

May 13, 11:00 AM

Rank Over Time

The Lowdown

Needle, developed by Cactus Compute, is a groundbreaking 26-million-parameter function-calling (tool use) model designed to run efficiently on consumer devices like phones, watches, and smart glasses. Its core innovation lies in its 'Simple Attention Network' architecture, which relies solely on attention and gating mechanisms, completely omitting Feed-Forward Networks (FFNs).

Miniature yet Mighty: At just 26M parameters, Needle delivers impressive performance, achieving 6000 tokens/second prefill and 1200 tokens/second decode speeds on edge devices. The entire model is only 14MB in INT4 format.
Novel Architecture: The creators posit that for 'retrieval-and-assembly' tasks like function calling, large models are overkill, and FFN parameters are unnecessary. Needle's design validates this by using only cross-attention, which they found generalizes to other knowledge-driven tasks like RAG.
Training Regimen: It was pretrained on 200 billion tokens over 27 hours on 16 TPU v6e units, then post-trained for 45 minutes on 2 billion tokens of synthesized function-calling data, generated using Gemini across 15 tool categories.
Competitive Edge: Needle reportedly surpasses FunctionGemma-270M, Qwen-0.6B, and other larger models in single-shot function calling, though it's not intended for broader conversational tasks.
Open-Source & Accessible: The project is MIT licensed, with weights available on Hugging Face. It's designed for easy local testing and finetuning on Mac/PC via a provided playground and CLI tools.
Broader Vision: Needle is part of Cactus Compute's larger initiative to develop an inference engine specifically for mobile, wearables, and custom hardware, promoting local, personal AI.

This innovative approach to highly efficient, task-specific AI represents a significant step towards practical, on-device agentic models that prioritize speed and resource conservation over general-purpose reasoning.

The Gossip

Pocket-Sized Power: Potential for Pervasive Personal AI

Commenters were enthusiastic about Needle's potential to enable practical, on-device AI. Many envisioned integrating it into command-line interfaces for natural language parsing, creating more responsive smart home assistants like Home Assistant, or serving as a specialized 'first pass' in more complex multi-agent systems. The creator confirmed that the model's primary goal was indeed to act as a local, Siri-like core for edge devices, managing timers, weather requests, and other tool calls.

Deployment Dialogues: From Local Labs to Live Demos

The community actively discussed the practicalities of deploying and testing Needle. Initial issues with accessing the tokenizer repository were quickly resolved by the author. Several users requested a live demo for easier evaluation, leading to a community member successfully deploying a playground instance to a Hugging Face Space. There were also questions about CPU compatibility and running the model in containerized environments, highlighting the desire for accessible local deployment.

Distillation Disputes: ToS Tussles and Ethical Enquiries

A notable portion of the discussion revolved around the term 'distilled Gemini' and its implications. Some users raised concerns about violating Google's Terms of Service, which prohibit using their services to develop competing models or replicate components. This sparked a debate about the ethics of such practices, with counter-arguments emerging that questioned Google's own data acquisition methods. The author clarified that Needle does not compete with Gemini and the distillation process involved synthesizing data via Gemini, not accessing or replicating its model weights.