Research has shown that applying a warmup phase during optimization limits the amount of parameter divergence in very deep neural networks. Because the network weights are randomly initialized, the parts of the network that require the most time to make progress are highly susceptible to significant divergence early in training. Gradually increasing the learning rate during a warmup period mitigates this instability, leading to better initial convergence.

Claude

To address the dilemma of choosing an initial learning rate that is either too small (causing slow progress) or too large (causing divergence), a simple strategy called optimizer warmup is used. During a warmup period, the learning rate gradually increases—typically linearly—from a small value to its initial maximum, after which it cools down until the end of the optimization process.

Optimizer Warmup

Dive into Deep Learning

The fundamental intuition behind using an optimizer warmup phase is that random parameter initialization, especially in advanced or very deep neural networks, often leads to unstable optimization and significant early divergence. A warmup period mitigates this by starting with a small learning rate, which effectively limits the amount of parameter divergence in the parts of the network that take the most time to make initial progress. Once the parameters have stabilized, the learning rate can be safely increased to avoid slow training.

Intuition Behind Optimizer Warmup

Effect of Warmup on Parameter Divergence

A learning rate warmup can be applied to various learning rate schedules, such as a cosine schedule, to stabilize training and improve initial convergence. In deep learning frameworks, this is often configured using a warmup steps parameter. For example, a cosine scheduler can be configured to linearly increase the learning rate for the first 5 steps before applying standard cosine decay. Plotting the learning rate schedule over epochs visually demonstrates this initial linear increase followed by the cooling-down (decay) period.

```python
num_epochs = 20
scheduler = CosineScheduler(num_epochs, warmup_steps=5, base_lr=0.3, final_lr=0.01)
d2l.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
```

Learning Rate Warmup Schedule Example

To empirically observe the benefits of a learning rate warmup, a neural network can be trained using an optimizer configured with a warmup scheduler. The training metrics typically show that the network converges better initially—especially during the warmup epochs—compared to training without it. This improved early performance stabilizes the optimization process for advanced networks.

```python
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=0.3)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device, scheduler)
```

Learn Before

Related