Using a learning rate scheduler during training results in less overfitting compared to using a constant learning rate. Although the exact theoretical reason is not fully resolved, one argument posits that a smaller step size leads to model parameters that are closer to zero and therefore simpler. However, this argument does not completely explain the phenomenon, as the training does not stop early but simply reduces the learning rate gently.

Overfitting Reduction via Learning Rate Scheduling

When an optimization algorithm is run using a constant, un-decayed learning rate, the model often becomes prone to overfitting as training progresses. For example, if a modernized LeNet architecture is trained on Fashion-MNIST with a default constant learning rate of $$0.3$$ for $$30$$ iterations, the training accuracy will continue to rise while the test accuracy stalls after a certain point. The resulting gap between the training and test accuracy curves is a clear visual indicator of overfitting.

Claude

To empirically study learning rate scheduling, a computationally efficient yet nontrivial toy problem can be constructed. A common setup involves training a modernized LeNet architecture on the Fashion-MNIST dataset. This modernized LeNet updates the classic design by replacing sigmoid activations with ReLU activations and substituting AveragePooling operations with MaxPooling.

Learn Before

Related

Learn After