1Cademy - Multi-GPU Training Loop Implementation from Scratch

Learn Before

Multi-GPU Minibatch Training Implementation

Code

Multi-GPU Training Loop Implementation from Scratch

The train function orchestrates the full multi-GPU data-parallel training loop implemented from scratch. It accepts the number of GPUs, batch size, and learning rate as arguments. The procedure is:

Setup: Load the Fashion-MNIST dataset, allocate the specified number of GPU devices using d2l.try_gpu(i), and replicate the model parameters to each GPU via get_params.
Epoch loop: For each of 10 epochs, iterate over every minibatch in the training set and call train_batch to perform the data-parallel forward pass, gradient synchronization, and parameter update. A synchronization barrier (torch.cuda.synchronize()) is called after every minibatch to ensure all GPU operations complete before the next minibatch begins, which is also necessary for accurate epoch timing.
Evaluation: After each epoch, test accuracy is computed on a single GPU (GPU 0) using d2l.evaluate_accuracy_gpu, passing a lambda that wraps the model with the first device's parameters. Although evaluating on only one GPU leaves the others idle, it simplifies the code.

def train(num_gpus, batch_size, lr):
    train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size)
    devices = [d2l.try_gpu(i) for i in range(num_gpus)]
    device_params = [get_params(params, d) for d in devices]
    num_epochs = 10
    animator = d2l.Animator('epoch', 'test acc', xlim=[1, num_epochs])
    timer = d2l.Timer()
    for epoch in range(num_epochs):
        timer.start()
        for X, y in train_iter:
            train_batch(X, y, device_params, devices, lr)
            torch.cuda.synchronize()
        timer.stop()
        animator.add(epoch + 1, (d2l.evaluate_accuracy_gpu(
            lambda x: lenet(x, device_params[0]), test_iter, devices[0]),))
    print(f'test acc: {animator.Y[0][-1]:.2f}, '
          f'{timer.avg():.1f} sec/epoch on {str(devices)}')