Faster Asin() Was Hiding in Plain Sight
A ray tracing enthusiast's quest for a faster asin() function leads through custom Taylor series and Padé approximants, only to discover a decades-old, superior solution via an LLM prompt. This tale of hidden optimization, found lurking in obscure documentation, underscores the importance of thorough research and the surprising utility of AI in unearthing legacy knowledge. It's a prime example of performance gains "hiding in plain sight," a recurring theme beloved by the HN crowd.
The Lowdown
The author, a ray tracing enthusiast, embarked on a mission to optimize the asin() function, a frequently called trigonometric operation in their PSRayTracing project. Their journey began with developing custom Taylor series and Padé approximants, which, while offering modest speedups and improved accuracy over simple approximations, ultimately required complex workarounds for edge cases.
- Initial efforts involved hand-rolling Taylor series approximations, yielding a ~5% speedup but requiring fallback to
std::asin()for values outside a narrow range. - Padé approximants improved accuracy but didn't significantly boost performance over the Taylor series, even with half-angle transformations for edge correction.
- The breakthrough came unexpectedly: an LLM, asked for a fast
asin()approximation, pointed to a solution from Nvidia's Cg Toolkit, based on a 1960s mathematical formula. - This Cg approximation proved highly accurate and delivered substantial speedups, particularly on Intel CPUs (up to ~1.9x faster), though less so on Apple M4 systems.
- The author reflects on the irony of an AI unearthing a well-established, optimized solution that had been overlooked despite extensive personal effort and public discussion of their project.
The story serves as a valuable lesson on the critical need for comprehensive research into existing solutions before committing to custom development, highlighting how even optimal answers can lie dormant in old, specialized documentation.
The Gossip
Platform Performance Paradox
Commenters noted the significant performance disparity of the `asin_cg` approximation between Intel (up to 1.9x faster) and Apple M4 (barely 1.02x faster). This led to speculation that Apple's `libm` implementation likely already employs sophisticated polynomial approximations, similar to the discovered one, tuned for its hardware. The discussion also highlighted how such approximations have been standard in shader math and game development for decades, implying a knowledge silo between different programming domains.
Legacy Knowledge and Optimization Lore
Many comments resonated with the author's experience of 'hidden in plain sight' optimizations. Parallels were drawn to other famous examples, such as Quake III Arena's fast inverse square root, which also originated from obscure library code. This sparked a discussion on how valuable algorithms and mathematical tools, like the minimax approximation and Remez algorithm, can languish in specialized fields or old documentation, often rediscovered by new generations of programmers.
Approximation Alternatives and Trade-offs
The discussion branched into alternative methods for function approximation and their respective trade-offs. SIMD (Single Instruction, Multiple Data) was suggested as a potentially even faster, orthogonal improvement. The utility of Look-Up Tables (LUTs) was debated, with some arguing for their L1 cache benefits for certain precision requirements, while others cautioned against their memory access overhead potentially outweighing arithmetic computation for pure calculations.