Tracking down a 25% Regression on LLVM RISC-V

A blog post details the intricate process of identifying and resolving a significant performance regression within the LLVM compiler for RISC-V targets. The issue led to LLVM generating less efficient code compared to GCC, costing nearly 25% in additional cycles for a specific benchmark.

The problem was first identified when benchmarking LLVM against GCC on a SiFive P550 CPU, where LLVM showed an ~8% higher cycle count.
Assembly analysis revealed LLVM was using fdiv.d (double precision float division, 33 cycles latency) instead of fdiv.s (single precision, 19 cycles latency) in a critical loop, unlike GCC or older LLVM builds.
Using llvm-mca and comparing LLVM IR at different optimization stages, the author pinpointed that the middle-end optimization pipeline was failing to narrow a double-precision calculation to single-precision.
The root cause was a recent LLVM commit (190235) that improved isKnownExactCastIntToFP to fold certain fpext operations. While an improvement in itself, this change removed an intermediate fpext instruction that a downstream visitFPTrunc pass relied upon to perform the float narrowing.
The fix involved extending getMinimumFPType with range analysis and introducing canBeCastedExactlyIntToFP to allow visitFPTrunc to recognize and perform the necessary narrowing optimization even without the explicit fpext instruction.
The successful patch (190550) restored the optimization, eliminating the fdiv.d instruction and resulting in a 25% performance improvement for the benchmark.

This detailed forensic investigation into compiler behavior highlights the delicate balance and complex interactions within optimization passes, where improvements in one area can inadvertently create regressions elsewhere, underscoring the continuous challenge of compiler development.

Tracking down a 25% Regression on LLVM RISC-V

The Lowdown