When an optimization algorithm is run using a constant, un-decayed learning rate, the model often becomes prone to overfitting as training progresses. For example, if a modernized LeNet architecture is trained on Fashion-MNIST with a default constant learning rate of $$0.3$$ for $$30$$ iterations, the training accuracy will continue to rise while the test accuracy stalls after a certain point. The resulting gap between the training and test accuracy curves is a clear visual indicator of overfitting.

Overfitting with a Constant Learning Rate

To empirically study learning rate scheduling, a computationally efficient yet nontrivial toy problem can be constructed. A common setup involves training a modernized LeNet architecture on the Fashion-MNIST dataset. This modernized LeNet updates the classic design by replacing sigmoid activations with ReLU activations and substituting AveragePooling operations with MaxPooling.

Claude

A learning rate scheduler is a mechanism used to dynamically adjust the learning rate during the training of a model. Rather than keeping the learning rate constant, the scheduler acts as a function that takes the number of optimization updates (such as iterations or epochs) and outputs the appropriate learning rate value for the next step.

Learning Rate Scheduler

The classic LeNet-5 neural network can be modernized to improve performance and reflect contemporary deep learning practices. This modernization primarily involves two architectural updates: replacing the original sigmoid activation functions with Rectified Linear Unit (ReLU) activations, and substituting average pooling operations with max pooling. When applied to classification tasks like Fashion-MNIST, this slightly modernized LeNet provides an efficient, nontrivial baseline model for evaluating training dynamics and optimization algorithms.

Modernized LeNet Architecture

Dive into Deep Learning

Applying a learning rate scheduler to gently decrease the learning rate over the course of training can lead to improved model accuracy and less overfitting compared to using a constant learning rate. While the exact cause is debated, one theoretical explanation suggests that taking smaller step sizes forces the model parameters to remain closer to zero, resulting in a simpler model, although this does not completely explain the phenomenon.

Effect of Learning Rate Scheduling on Overfitting

A polynomial learning rate decay schedule gradually reduces the learning rate over the course of training according to a polynomial function. It is considered one of the common policy choices for dynamically adjusting the learning rate.

Polynomial Learning Rate Decay

A piecewise constant learning rate schedule, often referred to as a multi-factor scheduler, maintains a steady learning rate for predefined intervals of training and abruptly drops the rate at specific milestones. Mathematically, given a set of milestone times $$s$$ (such as $$s = \{5, 10, 20\}$$), the learning rate is updated according to the rule $$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$$ whenever the current step $$t \in s$$, where $$\alpha$$ is the designated decay factor.

Piecewise Constant Learning Rate Schedule

A cosine learning rate schedule, proposed by Loshchilov and Hutter (2016), dynamically adjusts the learning rate by following the shape of a cosine curve. It relies on the observation that the learning rate should not decrease too drastically at the beginning of training, and that the solution should be refined at the end using a very small learning rate. For learning rates in the range $$t \in [0, T]$$, this results in a schedule with the functional form:

$$\eta_t = \eta_T + \frac{\eta_0 - \eta_T}{2} \left(1 + \cos(\pi t/T)\right)$$

Here, $$\eta_0$$ is the initial learning rate and $$\eta_T$$ is the target rate at the maximum update step $$T$$. For steps $$t > T$$, the learning rate is simply pinned to $$\eta_T$$ without increasing it again.

Cosine Learning Rate Schedule

To address the dilemma of choosing an initial learning rate that is either too small (causing slow progress) or too large (causing divergence), a simple strategy called optimizer warmup is used. During a warmup period, the learning rate gradually increases—typically linearly—from a small value to its initial maximum, after which it cools down until the end of the optimization process.

Optimizer Warmup

A factor learning rate scheduler is an alternative to polynomial decay that utilizes a multiplicative reduction strategy. The learning rate is updated at each step using the equation $$\eta_{t+1} \leftarrow \eta_t \cdot \alpha$$, where $$\alpha \in (0, 1)$$. To prevent the learning rate from decaying beyond a reasonable lower bound, the update equation is typically modified to $$\eta_{t+1} \leftarrow \mathop{\mathrm{max}}(\eta_{\mathrm{min}}, \eta_t \cdot \alpha)$$.

Factor Learning Rate Scheduler

A basic learning rate adjustment can be performed explicitly at each step by modifying the optimizer's parameter groups directly. The following Python code demonstrates how to set a new learning rate manually in a PyTorch optimizer:

```python
lr = 0.1
trainer.param_groups[0]["lr"] = lr
print(f'learning rate is now {trainer.param_groups[0]["lr"]:.2f}')
```

Explicit Learning Rate Adjustment Implementation

Learning Rate Scheduler Toy Problem

A simple learning rate scheduler can be defined to decay the learning rate proportionally to the inverse square root of the number of updates. Specifically, at step $$t$$, the learning rate is set to $$\eta = \eta_0 (t + 1)^{-\frac{1}{2}}$$, where $$\eta_0$$ is the initial learning rate. This formulation ensures that the step size gently and continuously decreases as training progresses.

Learn Before

Related

Learn After