What happens when you run a CUDA kernel?

The article "What happens when you run a CUDA kernel?" provides an exhaustive, step-by-step walkthrough of the entire lifecycle of a basic CUDA vector addition program. It meticulously traces the execution path from the host CPU's call to the GPU, delving into the underlying compilation, driver interactions, hardware scheduling, and memory operations, ultimately demonstrating how a seemingly simple operation involves a complex orchestration of technologies.

Compilation Process: The nvcc compiler acts as a driver, orchestrating various tools like cicc (to PTX, a virtual ISA) and ptxas (to SASS, the device-specific assembly). The PTX is carried along for forward compatibility, allowing JIT compilation on unsupported architectures.
Host-GPU Communication Setup: A hidden constructor registers the compiled fatbinary with the CUDA runtime. The <<<...>>> launch syntax is replaced by a host stub that packs kernel arguments into a buffer. The libcuda.so.1 user-mode driver and nvidia.ko kernel-mode driver are dynamically loaded, creating a 'context' and channel for CPU-GPU communication.
Kernel Launch Mechanics: The CPU driver prepares the GPU by filling a 'pushbuffer' with 'methods' (GPU commands) and pointing a 'GPFIFO' entry to it. These methods include streaming the 'Queue Meta Data' (QMD), which contains launch configuration, kernel arguments, and the SASS code's address. The CPU then "rings the doorbell" (writes to a memory-mapped register) to alert the GPU's host engine to process the new work.
GPU Execution: The host engine passes the QMD to the compute work distributor, which assigns blocks to Streaming Multiprocessors (SMs). Each SM schedules warps (groups of 32 threads) across its sub-partitions, utilizing static stall counts, yield hints, and dependency barriers (scoreboards) embedded in the SASS instructions to hide latency.
Memory Operations: The SM's load/store unit coalesces memory requests from warps, efficiently fetching data from L1 cache, then L2 cache, and finally GDDR6X VRAM. Profiling reveals the sample kernel's execution is memory-bound due to low arithmetic intensity.
Result Retrieval: After the kernel completes, the GPU posts a completion semaphore. The cudaMemcpy operation then triggers the GPU's copy engine to DMA the results from the GPU's L2 cache (where STG.E stores kept it) back to the host CPU, where printf displays the output.
Appendix Insights: The author details methods for inspecting this low-level process, including LD_PRELOAD shims, decoding pushbuffer commands, analyzing QMD layouts, interpreting ioctl calls, and reverse-engineering SASS control words, highlighting the "legibility transition" enabled by modern tools and persistence.

This journey through the GPU's inner workings underscores the incredible complexity abstracted away by high-level CUDA programming, illustrating the sophisticated hardware and software layers that coordinate to achieve massive parallel computation.

What happens when you run a CUDA kernel?

The Lowdown