Learn Before
Concept
GPU Memory and Interconnect Latencies
Hardware accelerators like Graphics Processing Units (GPUs) have distinct latency profiles tailored for massive parallelism. Accessing a GPU's Shared Memory takes about ns (roughly to cycles, which can increase with bank conflicts), while accessing its much larger Global Memory is slower, taking approximately ns ( to cycles). When transferring MB of data to or from the GPU, dedicated high-speed interconnects like NVLink take roughly μs (achieving GB/s), compared to roughly μs over a standard PCIe 3.0 x16 link ( GB/s). Additionally, instructing the GPU to launch a CUDA kernel from the host CPU incurs an overhead latency of about μs.
0
1
Updated 2026-05-18
Tags
D2L
Dive into Deep Learning @ D2L