Microbenchmarking Chipsets for Giggles
Chipsets have evolved from performance powerhouses to peripheral I/O hubs, but what happens when you prod them for giggles? This deep technical dive meticulously benchmarks GPU-to-host memory latency across a spectrum of AMD and Intel platforms, revealing that despite their diminishing performance role, chipsets still introduce substantial nanosecond penalties and affect cache behavior. It's a low-level hardware exploration for the truly curious, shedding light on the often-overlooked bottlenecks in modern system architectures.
The Lowdown
In a world where CPUs increasingly integrate critical functions, motherboard chipsets have gracefully receded, largely handling less performance-sensitive I/O. Yet, the question of their precise impact on system latency persists. This article takes a deep, 'for giggles' dive into microbenchmarking various chipset architectures, quantifying their effect on GPU-to-host memory access.
- Benchmarking Approach: A modified Vulkan-based GPU benchmark was employed to measure latency for GPU access to host memory via PCIe, using an Nvidia T1000 card across diverse AMD and Intel systems.
- AMD Zen 5 Findings: Direct CPU PCIe exhibited ~650 ns latency. Routing through a single PROM21 chip (B650) added ~570 ns, while two PROM21 chips (X670E) incurred an even heftier ~921 ns penalty. Chipset paths also drastically cut GPU cache hit bandwidth to roughly 25 GB/s.
- Intel Arrow Lake Results: CPU PCIe delivered a ~785 ns baseline. The Platform Controller Hub (PCH) added around 550 ns, aligning with single-chipset penalties on AMD. Anomalous T1000 cache behavior was observed, though other GPUs maintained full cache hit bandwidth.
- Intel Skylake Z170 Insights: Showcased a superior ~535 ns baseline via CPU lanes. Its PCH added 338 ns, an impressive figure compared to contemporary platforms. PCH usage reduced cache hit bandwidth to 51 GB/s.
- Legacy AMD 990X Architecture: Northbridge PCIe provided a competitive ~770 ns latency. The southbridge added 602 ns, still outperforming modern dual-chipset setups. Cache hit bandwidth was substantial even through the southbridge, far exceeding platform I/O.
- Coherency & Probes: The use of
VK_MEMORY_PROPERTY_HOST_COHERENT_BITinduced significant probe traffic, with observed T1000 behavior suggesting 512-byte probes instead of the expected 64-byte cacheline probes, highlighting underlying hardware mysteries. - CPU-to-VRAM Correlation: Supplementary tests measuring CPU access to GPU VRAM largely mirrored the GPU-to-host memory results, reinforcing the conclusions about chipset latency.
This rigorous microbenchmark reveals that while chipsets are no longer performance bottlenecks in the traditional sense, they consistently introduce hundreds of nanoseconds of latency and can constrain bandwidth when bridging the CPU to other PCIe devices. The findings suggest that future chipset evolution will likely prioritize cost efficiency and I/O connectivity for less latency-sensitive peripherals, rather than optimizing for the vanishingly small number of use cases where nanosecond precision across the chipset truly matters.