Notes from Optimizing CPU-Bound Go Hot Paths

The author shares insights from optimizing a Go port of Brotli, identifying recurrent patterns where idiomatic Go abstractions impede performance in CPU-bound hot paths. This deep dive into Go's performance characteristics illuminates several key areas where its design choices, while beneficial for general development, create friction for extreme optimization.

Lack of Zero-Cost Abstractions: Go's generics, interfaces, and closures often prevent compiler inlining in hot loops, leading to significant performance penalties compared to concrete implementations. Unlike C++ or Rust's full monomorphization, Go's GC Shape Stenciling approach for generics results in interface-style dispatch, necessitating manual code duplication for speed.
Missing Intrinsics: Go lacks user-accessible CPU intrinsics for operations like prefetching or SIMD, which are crucial for high-performance computing. While some intrinsics exist internally, their unavailability to user code means developers must either accept slower Go code or resort to non-inlinable assembly functions.
Absence of //go:inline: Go provides //go:noinline but no explicit //go:inline directive. This asymmetry forces developers to restructure functions to fit the compiler's heuristic inlining budget (80 units), or manually inline code, further contributing to duplication.
No //go:nobounds: While Go performs bounds check elimination (BCE) where possible, there's no way to explicitly inform the compiler about known safe access patterns. This leads to performance overhead from unnecessary checks, pushing developers to use hints like _ = b[3] or unsafe operations.
Layout Tooling Deficiencies: The sensitivity of CPU caches and branch predictors to code memory layout makes benchmarking and verifying optimizations difficult in Go. Unlike C++ and Rust, Go's toolchain lacks advanced profiling and layout rearrangement tools, introducing noise and uncertainty into performance measurements.

The author concludes that while Go excels in IO-bound applications due to its strong standard library, package management, and async capabilities, its approach to CPU-bound work demands a different mindset. Optimizing Go hot paths often involves eschewing elegant abstractions for code duplication, manual specialization, bounds check elimination tricks, and sometimes even direct assembly. The trade-offs mean fast Go code might not look idiomatic, featuring large functions, duplicated loops, and APIs tailored for inlining and escape analysis.

Notes from Optimizing CPU-Bound Go Hot Paths

The Lowdown