HN
Today

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

This project, Parlor, showcases how to run real-time, multimodal AI entirely on-device, leveraging Google's Gemma 4 E2B model on an Apple M3 Pro. It eliminates server costs for AI applications, making advanced features like natural voice and vision conversations accessible locally. This technical achievement is gaining traction on HN for its demonstration of efficient, self-hostable AI and its implications for privacy and cost-effective AI deployment.

4
Score
0
Comments
#8
Highest Rank
14h
on Front Page
First Seen
Apr 6, 5:00 AM
Last Seen
Apr 6, 6:00 PM
Rank Over Time
2013910141110891113182227

The Lowdown

Parlor is an open-source project demonstrating real-time, on-device multimodal AI, allowing users to engage in natural voice and vision conversations with an AI that runs entirely on their local machine. Powered by Google's Gemma 4 E2B for understanding speech and vision, and Kokoro for text-to-speech, this solution aims to democratize AI access by eliminating the need for costly server infrastructure, a significant step forward for privacy and accessibility in AI.

  • On-Device AI: The core innovation is running complex multimodal AI models, including speech and vision understanding, directly on a user's machine, specifically highlighting performance on an Apple M3 Pro.
  • Cost-Efficiency and Sustainability: The project's motivation stems from the author's desire to make a free AI-powered English learning tool sustainable by removing server costs, which previously required high-end GPUs like an RTX 5090 for real-time operation.
  • Key Technologies: It integrates Gemma 4 E2B from Google for multimodal input processing and Kokoro for efficient text-to-speech generation, with LiteRT-LM and MLX/ONNX backends.
  • Real-time Interaction Features: Parlor supports hands-free operation with Voice Activity Detection (VAD), barge-in capabilities (interrupting the AI mid-sentence), and sentence-level TTS streaming for a highly responsive user experience.
  • Minimal Requirements: It requires Python 3.12+, macOS (Apple Silicon) or Linux (supported GPU), and approximately 3 GB of RAM, making it accessible to a wide range of modern hardware.
  • Performance Metrics: On an Apple M3 Pro, the total end-to-end latency for a conversation turn (understanding, generation, TTS) is impressively low, ranging from ~2.5 to 3.0 seconds, with a decode speed of ~83 tokens/sec.
  • Future Implications: The author envisions this technology as a "game-changer" for applications like language learning, potentially running on mobile phones, allowing users to interact with AI about their surroundings in multiple languages. This 'research preview' represents a significant milestone in bringing sophisticated AI capabilities to the edge, fostering a future where powerful AI assistants are locally hosted, private, and free from recurring computational expenses, reminiscent of early, ambitious OpenAI demonstrations.