AVX2 is slower than SSE2-4.x under Windows ARM emulation
This deep technical dive benchmarks AVX2 and SSE2-4.x performance under Windows ARM emulation, revealing a surprising slowdown for AVX2. The author was "nerdsniped" by his own curiosity, meticulously comparing real hardware to emulated environments. For developers, the clear takeaway is that relying on emulation for advanced instruction sets can degrade performance, making native ARM compilation essential for speed.
The Lowdown
The author, David Millington, undertook a detailed benchmarking project after noticing unexpected performance in his math libraries on Windows ARM emulation. His goal was to compare the execution speed of AVX2-optimized code against older SSE2-4.x code when run via emulation on Windows 11 ARM. Contrary to what many might expect, AVX2 was not faster.
- Windows 11 on ARM uses an emulation layer, Prism, to run x86/x64 applications, which has been marketed for its performance improvements.
- The x64 architecture has evolved with different instruction set levels; x64 v2 includes SSE2-4.x, while x64 v3 incorporates AVX2 and FMA.
- The benchmark involved running 21 different math operations, heavily vectorized and using 256-bit wide AVX2 instructions, on both native x64 hardware (Intel i7) and an Apple M2 running Windows 11 ARM via Parallels.
- On native Intel hardware, AVX2 code was, as anticipated, 2.7 times faster than SSE2-4.x code.
- However, when emulated on the ARM platform, the AVX2 code performed significantly worse, running at only 2/3 the speed of the SSE2-4.x code.
- Potential reasons for this include AVX2's 256-bit operations requiring translation into two 128-bit NEON operations on ARM, the relative newness and potential lack of optimization in Prism's AVX2 emulation, or specific optimizations for different data types or CPU architectures.
- The conclusion for developers is stark: if performance is a priority, applications should be compiled natively for ARM, rather than relying on x64 emulation, especially for code utilizing advanced instruction sets like AVX2.