What happens when you run a CUDA kernel?
This post meticulously dissects the journey of a simple CUDA vector addition kernel, revealing the intricate dance between CPU and GPU hardware and software. It covers everything from nvcc compilation into PTX and SASS, through the driver's role in packaging a QMD and ringing the GPU's doorbell, to the Streaming Multiprocessor's warp scheduling and memory access patterns. For anyone who's ever wondered what truly happens "under the hood" when cudaLaunchKernel is called, this deep dive provides a rare and fascinating glimpse into the GPU's operational mechanics.
The Lowdown
The article "What happens when you run a CUDA kernel?" provides an exhaustive, step-by-step walkthrough of the entire lifecycle of a basic CUDA vector addition program. It meticulously traces the execution path from the host CPU's call to the GPU, delving into the underlying compilation, driver interactions, hardware scheduling, and memory operations, ultimately demonstrating how a seemingly simple operation involves a complex orchestration of technologies.
- Compilation Process: The
nvcccompiler acts as a driver, orchestrating various tools likecicc(to PTX, a virtual ISA) andptxas(to SASS, the device-specific assembly). The PTX is carried along for forward compatibility, allowing JIT compilation on unsupported architectures. - Host-GPU Communication Setup: A hidden constructor registers the compiled fatbinary with the CUDA runtime. The
<<<...>>>launch syntax is replaced by a host stub that packs kernel arguments into a buffer. Thelibcuda.so.1user-mode driver andnvidia.kokernel-mode driver are dynamically loaded, creating a 'context' and channel for CPU-GPU communication. - Kernel Launch Mechanics: The CPU driver prepares the GPU by filling a 'pushbuffer' with 'methods' (GPU commands) and pointing a 'GPFIFO' entry to it. These methods include streaming the 'Queue Meta Data' (QMD), which contains launch configuration, kernel arguments, and the SASS code's address. The CPU then "rings the doorbell" (writes to a memory-mapped register) to alert the GPU's host engine to process the new work.
- GPU Execution: The host engine passes the QMD to the compute work distributor, which assigns blocks to Streaming Multiprocessors (SMs). Each SM schedules warps (groups of 32 threads) across its sub-partitions, utilizing static stall counts, yield hints, and dependency barriers (scoreboards) embedded in the SASS instructions to hide latency.
- Memory Operations: The SM's load/store unit coalesces memory requests from warps, efficiently fetching data from L1 cache, then L2 cache, and finally GDDR6X VRAM. Profiling reveals the sample kernel's execution is memory-bound due to low arithmetic intensity.
- Result Retrieval: After the kernel completes, the GPU posts a completion semaphore. The
cudaMemcpyoperation then triggers the GPU's copy engine to DMA the results from the GPU's L2 cache (whereSTG.Estores kept it) back to the host CPU, whereprintfdisplays the output. - Appendix Insights: The author details methods for inspecting this low-level process, including
LD_PRELOADshims, decoding pushbuffer commands, analyzing QMD layouts, interpretingioctlcalls, and reverse-engineering SASS control words, highlighting the "legibility transition" enabled by modern tools and persistence.
This journey through the GPU's inner workings underscores the incredible complexity abstracted away by high-level CUDA programming, illustrating the sophisticated hardware and software layers that coordinate to achieve massive parallel computation.