When applying standard gradient descent to an ill-conditioned objective function, optimizing the learning rate creates a difficult dilemma. Because the gradient changes at drastically different rates across different dimensions, choosing a small learning rate prevents the solution from diverging in the steep directions (e.g., $$x_2$$) but results in extremely slow convergence in the flat directions (e.g., $$x_1$$). Conversely, choosing a large learning rate speeds up progress in the flat directions but causes the solution to diverge or oscillate wildly in the steep directions, significantly deteriorating the overall quality of the solution.

Claude

An ill-conditioned optimization problem is one that features some directions where progress is much slower than in others, creating a landscape that resembles a narrow canyon. Accelerated gradient methods are particularly effective in these scenarios because averaging subsequent gradients produces more stable directions of descent, avoiding the erratic zigzagging of standard gradient methods.

Ill-conditioned Optimization Problem

Dive into Deep Learning

An illustrative example of an ill-conditioned objective function is the highly distorted ellipsoid given by $$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2$$. Although the function has its global minimum at $$(0, 0)$$, its geometry is very flat along the $$x_1$$ direction and steep along the $$x_2$$ direction. Consequently, the gradient changes much more rapidly with respect to $$x_2$$ than it does with respect to $$x_1$$, which exemplifies the characteristic geometric distortion of ill-conditioned optimization problems.

Learn Before

Related