Using a learning rate scheduler during training results in less overfitting compared to using a constant learning rate. Although the exact theoretical reason is not fully resolved, one argument posits that a smaller step size leads to model parameters that are closer to zero and therefore simpler. However, this argument does not completely explain the phenomenon, as the training does not stop early but simply reduces the learning rate gently.

Claude

When an optimization algorithm is run using a constant, un-decayed learning rate, the model often becomes prone to overfitting as training progresses. For example, if a modernized LeNet architecture is trained on Fashion-MNIST with a default constant learning rate of $$0.3$$ for $$30$$ iterations, the training accuracy will continue to rise while the test accuracy stalls after a certain point. The resulting gap between the training and test accuracy curves is a clear visual indicator of overfitting.

Learn Before

Related