A deep dive into SmallVector:push_back
This deep dive meticulously dissects the SmallVector::push_back implementation in LLVM, revealing subtle inefficiencies in how compilers handle growth paths due to callee-saved register spills. It details a clever tail-calling optimization that dramatically streamlines the fast path by moving the slow path out-of-line, reducing instructions and frame overhead. The post resonates with HN's technical audience by exposing the low-level mechanics of container performance and illustrating the complex interplay between C++ code, compilers, and generated assembly.
The Lowdown
This article provides an in-depth analysis of an optimization for SmallVector::push_back in LLVM, focusing on the assembly generated for "approximately trivially copyable" element types. SmallVector is a critical LLVM container, and push_back is a frequently executed operation, making its performance paramount.
- Initially, both Clang and GCC generated inefficient assembly for
push_back. The fast path (no reallocation) included callee-saved register spills and stack frame setup becausethisand the elementxhad to remain live across a conditionalgrow_podcall and a shared store instruction. - Standard compiler optimizations like "shrink wrapping" couldn't resolve this, as they don't duplicate blocks, and the live range of
this/xextended across the conditional jump. - The optimization introduced involves moving the
growAndPushBacklogic into a separate,LLVM_ATTRIBUTE_NOINLINEfunction and tail-calling it when reallocation is needed. This strategy ensures the fast path has optimal assembly, free of register spills and stack frames. - The optimization resulted in significant code size reductions (e.g.,
lld's.textsection shrank by 40,512 bytes) and a 0.41–0.51% reduction ininstructions:ufor clang builds, though it caused some minor binary size increases in outliers due to inliner behavior changes. - Comparisons with
std::vector,boost::container::small_vector, andabsl::InlinedVectorrevealed similar fast-path inefficiencies in these standard library and utility containers, often due to their approaches to handling slow paths or internal data representations. - A trade-off exists: while the tail-call optimization significantly improves single
push_backcalls, it can degrade performance in loops whereSmallVector's metadata (size, capacity) might be repeatedly reloaded from memory, unlikestd::vectorwhich might keep them in registers. - The post clarifies the definition of "approximately trivially copyable" types used by
SmallVector, which is broader thanis_trivially_copyableand allowsmemcpyfor types likestd::pair<int,int>. - It details
SmallVector's five-class hierarchy, explaining how each layer specializes for different concerns, ultimately leading to a more compact header (16 bytes) compared tostd::vector(24 bytes).
In essence, the analysis underscores that fast/slow path merges can introduce unexpected overheads, and a tail-called out-of-line slow path can be a potent optimization. However, such low-level changes can have non-monotonic effects on overall code size and performance due to their interaction with compiler inliners and register allocation strategies.