Concept

GPU Memory and Interconnect Latencies

Hardware accelerators like Graphics Processing Units (GPUs) have distinct latency profiles tailored for massive parallelism. Accessing a GPU's Shared Memory takes about 3030 ns (roughly 3030 to 9090 cycles, which can increase with bank conflicts), while accessing its much larger Global Memory is slower, taking approximately 200200 ns (200200 to 800800 cycles). When transferring 11 MB of data to or from the GPU, dedicated high-speed interconnects like NVLink take roughly 3030 μs (achieving 33\sim 33 GB/s), compared to roughly 8080 μs over a standard PCIe 3.0 x16 link (12\sim 12 GB/s). Additionally, instructing the GPU to launch a CUDA kernel from the host CPU incurs an overhead latency of about 1010 μs.

0

1

Updated 2026-05-18

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L