HN
Today

A tail-call interpreter in (nightly) Rust

A seasoned developer delves into Rust's bleeding-edge become keyword, crafting a Uxn CPU interpreter that leverages tail calls for extreme performance. This fascinating technical journey pits Rust against hand-written assembly, uncovering surprising compiler optimizations and perplexing performance regressions across architectures. It's a masterclass in low-level systems programming and the dark arts of interpreter optimization.

11
Score
0
Comments
#4
Highest Rank
4h
on Front Page
First Seen
Apr 5, 4:00 PM
Last Seen
Apr 5, 7:00 PM
Rank Over Time
4447

The Lowdown

Matt Keeter details his latest endeavor in high-performance Uxn CPU emulation, showcasing an interpreter built in nightly Rust utilizing the recently introduced become keyword for tail-call optimization. This work is a continuation of his multi-part series exploring interpreter performance, aiming to achieve assembly-level speed within the safety of Rust.

  • The Uxn CPU Basics: Uxn is a simple 256-instruction stack machine with limited memory, often emulated with a basic loop and a match statement for opcode dispatch.
  • Performance Bottlenecks: Prior Rust implementations suffered from stack-related issues and unpredictable branch behavior, while hand-written ARM64 assembly achieved significant speedups (40-50% faster) but was unsafe and complex to maintain.
  • Tail-Call Interpreter Concept: Inspired by the Massey Meta Machine, the core idea is to store VM state in function arguments (mapping to registers) and use tail calls to jump directly to the next instruction without growing the call stack.
  • Rust's become Keyword: The new become keyword in nightly Rust provides the necessary guarantee for true tail-call optimization, replacing the caller's stack frame and preventing stack overflows.
  • Implementation & Safety: The author developed a complex macro to reduce boilerplate in defining opcode functions, remarkably achieving this within 100% safe Rust.
  • Performance Results on ARM64: On M1 Macs, the tail-call interpreter outperformed both the previous Rust VM and the hand-written ARM64 assembly (e.g., Mandelbrot: 76ms vs 125ms VM vs 87ms Assembly).
  • Performance Results on x86-64: Performance was mixed; while better than the old Rust VM, it still lagged behind the hand-written x86 assembly (Mandelbrot: 175ms vs 168ms Assembly). This was attributed to "real bad codegen" from the Rust compiler, involving excessive register spills and restores.
  • WebAssembly Performance Woes: The tail-call approach performed significantly worse than the traditional VM in WebAssembly environments (Firefox, Chrome, wasmtime), indicating that current WASM JITs struggle to optimize this pattern effectively.

The tail-call interpreter has been integrated into the Raven project, becoming the default for ARM64 and an optional backend for x86-64. The author remains keen on further optimizing x86 and WebAssembly performance, highlighting the ongoing challenges and intricacies of pushing Rust to its performance limits.