Even Faster Asin() Was Staring Right at Me

Following up on a previous post, the author revisits the quest for a faster asin() implementation, finding further optimization by restructuring polynomial evaluation.

The core improvement involves applying Estrin's Scheme to a minimax polynomial approximation of asin(). This reordering allows the compiler and CPU to evaluate parts of the polynomial independently, reducing dependency chain length and enabling instruction-level parallelism.
Extensive benchmarking was performed across Intel, AMD, and Apple M4 CPUs, using various operating systems (Linux, Windows, macOS) and compilers (GCC, Clang, MSVC).
Results show significant speedups (up to 1.88x over std::asin()) on older Intel chips, and some benefit on Apple M4 with Clang, but negligible gains on AMD platforms.
Real-world testing in a ray tracer demonstrated a modest 3% improvement on Intel, while the Apple M4 showed no practical change, highlighting that micro-optimizations may not translate proportionally to application-level performance.
The author emphasizes the critical importance of diligent benchmarking, dispels the myth of simple LUT-based speedups for modern CPUs, and reminds readers that these are approximations, suitable for graphics but not all applications.

Ultimately, the article serves as a testament to the continuous pursuit of performance through deep understanding of both algorithms and underlying hardware, underscoring that collaboration and reevaluation are key to finding better solutions.

Even Faster Asin() Was Staring Right at Me

The Lowdown

The Gossip

Contextual Callbacks

Constexpr Clarifications

Historical Hacks & Horticultural Harmony