A deep dive into SmallVector:push_back

This article provides an in-depth analysis of an optimization for SmallVector::push_back in LLVM, focusing on the assembly generated for "approximately trivially copyable" element types. SmallVector is a critical LLVM container, and push_back is a frequently executed operation, making its performance paramount.

Initially, both Clang and GCC generated inefficient assembly for push_back. The fast path (no reallocation) included callee-saved register spills and stack frame setup because this and the element x had to remain live across a conditional grow_pod call and a shared store instruction.
Standard compiler optimizations like "shrink wrapping" couldn't resolve this, as they don't duplicate blocks, and the live range of this/x extended across the conditional jump.
The optimization introduced involves moving the growAndPushBack logic into a separate, LLVM_ATTRIBUTE_NOINLINE function and tail-calling it when reallocation is needed. This strategy ensures the fast path has optimal assembly, free of register spills and stack frames.
The optimization resulted in significant code size reductions (e.g., lld's .text section shrank by 40,512 bytes) and a 0.41–0.51% reduction in instructions:u for clang builds, though it caused some minor binary size increases in outliers due to inliner behavior changes.
Comparisons with std::vector, boost::container::small_vector, and absl::InlinedVector revealed similar fast-path inefficiencies in these standard library and utility containers, often due to their approaches to handling slow paths or internal data representations.
A trade-off exists: while the tail-call optimization significantly improves single push_back calls, it can degrade performance in loops where SmallVector's metadata (size, capacity) might be repeatedly reloaded from memory, unlike std::vector which might keep them in registers.
The post clarifies the definition of "approximately trivially copyable" types used by SmallVector, which is broader than is_trivially_copyable and allows memcpy for types like std::pair<int,int>.
It details SmallVector's five-class hierarchy, explaining how each layer specializes for different concerns, ultimately leading to a more compact header (16 bytes) compared to std::vector (24 bytes).

In essence, the analysis underscores that fast/slow path merges can introduce unexpected overheads, and a tail-called out-of-line slow path can be a potent optimization. However, such low-level changes can have non-monotonic effects on overall code size and performance due to their interaction with compiler inliners and register allocation strategies.

A deep dive into SmallVector:push_back

The Lowdown