The fundamental intuition behind using an optimizer warmup phase is that random parameter initialization, especially in advanced or very deep neural networks, often leads to unstable optimization and significant early divergence. A warmup period mitigates this by starting with a small learning rate, which effectively limits the amount of parameter divergence in the parts of the network that take the most time to make initial progress. Once the parameters have stabilized, the learning rate can be safely increased to avoid slow training.

Intuition Behind Optimizer Warmup

Research has shown that applying a warmup phase during optimization limits the amount of parameter divergence in very deep neural networks. Because the network weights are randomly initialized, the parts of the network that require the most time to make progress are highly susceptible to significant divergence early in training. Gradually increasing the learning rate during a warmup period mitigates this instability, leading to better initial convergence.

Effect of Warmup on Parameter Divergence

A learning rate warmup can be applied to various learning rate schedules, such as a cosine schedule, to stabilize training and improve initial convergence. In deep learning frameworks, this is often configured using a warmup steps parameter. For example, a cosine scheduler can be configured to linearly increase the learning rate for the first 5 steps before applying standard cosine decay. Plotting the learning rate schedule over epochs visually demonstrates this initial linear increase followed by the cooling-down (decay) period.

```python
num_epochs = 20
scheduler = CosineScheduler(num_epochs, warmup_steps=5, base_lr=0.3, final_lr=0.01)
d2l.plot(torch.arange(num_epochs), [scheduler(t) for t in range(num_epochs)])
```

Learning Rate Warmup Schedule Example

To empirically observe the benefits of a learning rate warmup, a neural network can be trained using an optimizer configured with a warmup scheduler. The training metrics typically show that the network converges better initially—especially during the warmup epochs—compared to training without it. This improved early performance stabilizes the optimization process for advanced networks.

```python
net = net_fn()
trainer = torch.optim.SGD(net.parameters(), lr=0.3)
train(net, train_iter, test_iter, num_epochs, loss, trainer, device, scheduler)
```

Learning Rate Warmup Training Example

To address the dilemma of choosing an initial learning rate that is either too small (causing slow progress) or too large (causing divergence), a simple strategy called optimizer warmup is used. During a warmup period, the learning rate gradually increases—typically linearly—from a small value to its initial maximum, after which it cools down until the end of the optimization process.

Claude

A learning rate scheduler is a mechanism used to dynamically adjust the learning rate during the training of a model. Rather than keeping the learning rate constant, the scheduler acts as a function that takes the number of optimization updates (such as iterations or epochs) and outputs the appropriate learning rate value for the next step.

Learning Rate Scheduler

When training advanced neural network designs, initializing the parameters is sometimes insufficient to guarantee stable optimization. This creates an optimization dilemma: choosing a sufficiently small initial learning rate prevents early divergence but results in extremely slow progress, whereas choosing a large initial learning rate leads to immediate divergence.

Dilemma of Initial Learning Rate

Dive into Deep Learning

Applying a learning rate scheduler to gently decrease the learning rate over the course of training can lead to improved model accuracy and less overfitting compared to using a constant learning rate. While the exact cause is debated, one theoretical explanation suggests that taking smaller step sizes forces the model parameters to remain closer to zero, resulting in a simpler model, although this does not completely explain the phenomenon.

Effect of Learning Rate Scheduling on Overfitting

A polynomial learning rate decay schedule gradually reduces the learning rate over the course of training according to a polynomial function. It is considered one of the common policy choices for dynamically adjusting the learning rate.

Polynomial Learning Rate Decay

A piecewise constant learning rate schedule, often referred to as a multi-factor scheduler, maintains a steady learning rate for predefined intervals of training and abruptly drops the rate at specific milestones. Mathematically, given a set of milestone times $$s$$ (such as $$s = \{5, 10, 20\}$$), the learning rate is updated according to the rule $$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$$ whenever the current step $$t \in s$$, where $$\alpha$$ is the designated decay factor.

Piecewise Constant Learning Rate Schedule

A cosine learning rate schedule, proposed by Loshchilov and Hutter (2016), dynamically adjusts the learning rate by following the shape of a cosine curve. It relies on the observation that the learning rate should not decrease too drastically at the beginning of training, and that the solution should be refined at the end using a very small learning rate. For learning rates in the range $$t \in [0, T]$$, this results in a schedule with the functional form:

$$\eta_t = \eta_T + \frac{\eta_0 - \eta_T}{2} \left(1 + \cos(\pi t/T)\right)$$

Here, $$\eta_0$$ is the initial learning rate and $$\eta_T$$ is the target rate at the maximum update step $$T$$. For steps $$t > T$$, the learning rate is simply pinned to $$\eta_T$$ without increasing it again.

Cosine Learning Rate Schedule

Optimizer Warmup

A factor learning rate scheduler is an alternative to polynomial decay that utilizes a multiplicative reduction strategy. The learning rate is updated at each step using the equation $$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$$, where $$\alpha \in (0, 1)$$. To prevent the learning rate from decaying beyond a reasonable lower bound, the update equation is typically modified to $$\eta_{t+1} \leftarrow \mathop{\mathrm{max}}(\eta_{\mathrm{min}}, \eta_t \cdot \alpha)$$.

Factor Learning Rate Scheduler

A basic learning rate adjustment can be performed explicitly at each step by modifying the optimizer's parameter groups directly. The following Python code demonstrates how to set a new learning rate manually in a PyTorch optimizer:

```python
lr = 0.1
trainer.param_groups[0]["lr"] = lr
print(f'learning rate is now {trainer.param_groups[0]["lr"]:.2f}')
```

Explicit Learning Rate Adjustment Implementation

To empirically study learning rate scheduling, a computationally efficient yet nontrivial toy problem can be constructed. A common setup involves training a modernized LeNet architecture on the Fashion-MNIST dataset. This modernized LeNet updates the classic design by replacing sigmoid activations with ReLU activations and substituting AveragePooling operations with MaxPooling.

Learning Rate Scheduler Toy Problem

A simple learning rate scheduler can be defined to decay the learning rate proportionally to the inverse square root of the number of updates. Specifically, at step $$t$$, the learning rate is set to $$\eta = \eta_0 (t + 1)^{-\frac{1}{2}}$$, where $$\eta_0$$ is the initial learning rate. This formulation ensures that the step size gently and continuously decreases as training progresses.

Learn Before

Related

Learn After