Global Synchronization in PyTorch
In PyTorch, developers can explicitly force the system to complete all pending backend computations before returning control to the frontend by utilizing a synchronization barrier. Specifically, calling torch.cuda.synchronize(device) blocks the Python frontend thread until every operation queued on the designated GPU device has finished executing. This global synchronization is essential for tasks such as precise performance benchmarking; without it, measured execution times would incorrectly reflect only the negligible delay of adding tasks to the backend queue, rather than the true computational duration.
0
1
Tags
D2L
Dive into Deep Learning @ D2L
Related
Global Synchronization in MXNet
Variable-Specific Synchronization in MXNet
Implicit Blockers in Deep Learning Frameworks
Global Synchronization in PyTorch
Example of Asynchronous Benchmarking
Scheduling Overhead in Multithreaded Deep Learning Systems
Example of Synchronous vs. Asynchronous Increment Benchmark
Minibatch Synchronization to Prevent Task Queue Overflow
Chip Vendor Performance Analysis Tools for Deep Learning
Automatic Multi-GPU Parallelism via Asynchronous Execution