Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

Parlor is an open-source project demonstrating real-time, on-device multimodal AI, allowing users to engage in natural voice and vision conversations with an AI that runs entirely on their local machine. Powered by Google's Gemma 4 E2B for understanding speech and vision, and Kokoro for text-to-speech, this solution aims to democratize AI access by eliminating the need for costly server infrastructure, a significant step forward for privacy and accessibility in AI.

On-Device AI: The core innovation is running complex multimodal AI models, including speech and vision understanding, directly on a user's machine, specifically highlighting performance on an Apple M3 Pro.
Cost-Efficiency and Sustainability: The project's motivation stems from the author's desire to make a free AI-powered English learning tool sustainable by removing server costs, which previously required high-end GPUs like an RTX 5090 for real-time operation.
Key Technologies: It integrates Gemma 4 E2B from Google for multimodal input processing and Kokoro for efficient text-to-speech generation, with LiteRT-LM and MLX/ONNX backends.
Real-time Interaction Features: Parlor supports hands-free operation with Voice Activity Detection (VAD), barge-in capabilities (interrupting the AI mid-sentence), and sentence-level TTS streaming for a highly responsive user experience.
Minimal Requirements: It requires Python 3.12+, macOS (Apple Silicon) or Linux (supported GPU), and approximately 3 GB of RAM, making it accessible to a wide range of modern hardware.
Performance Metrics: On an Apple M3 Pro, the total end-to-end latency for a conversation turn (understanding, generation, TTS) is impressively low, ranging from ~2.5 to 3.0 seconds, with a decode speed of ~83 tokens/sec.
Future Implications: The author envisions this technology as a "game-changer" for applications like language learning, potentially running on mobile phones, allowing users to interact with AI about their surroundings in multiple languages. This 'research preview' represents a significant milestone in bringing sophisticated AI capabilities to the edge, fostering a future where powerful AI assistants are locally hosted, private, and free from recurring computational expenses, reminiscent of early, ambitious OpenAI demonstrations.

Show HN: Real-time AI (audio/video in, voice out) on an M3 Pro with Gemma E2B

The Lowdown