Because gradients approach zero at a local minimum, an optimization algorithm can become trapped. However, introducing some degree of noise can knock the model's parameters out of these sub-optimal valleys. One of the beneficial properties of minibatch stochastic gradient descent is that the natural statistical variation of gradients evaluated over different minibatches inherently provides this necessary noise, successfully dislodging the parameters from local minima.

Claude

For a given objective function $$f(x)$$, a local minimum occurs at a point $$x$$ if the function's value $$f(x)$$ is smaller than its values at all other points within its immediate vicinity. In deep learning, objective functions usually possess many of these local optima. As an optimization algorithm nears a local minimum, the gradient approaches zero, which can cause the final numerical solution to only minimize the function locally rather than discovering the global minimum.

Local Minimum

Dive into Deep Learning

A global minimum is the point at which an objective function $$f(x)$$ achieves its absolute lowest value across its entire domain. Unlike a local minimum, which only represents the lowest value within a limited vicinity, the global minimum represents the true, optimal numerical solution for the overall optimization problem.

Global Minimum

To visualize the distinction between minima, consider the objective function $$f(x) = x \cdot \cos(\pi x)$$ defined over the domain $$-1.0 \leq x \leq 2.0$$. By evaluating and plotting this function, one can approximate both its local minimum—a point lowest only in its immediate vicinity—and its global minimum, which represents the lowest value across the entire specified interval.

Local and Global Minima Approximation

Escaping Local Minima

For a function whose input is a multi-dimensional vector and output is a scalar, a zero-gradient position represents a local minimum if all the eigenvalues of the function's Hessian matrix at that position are positive.

Local Minimum via Hessian Eigenvalues

Optimization in deep learning is fraught with challenges, making it extremely difficult to consistently find the true global minimum of an objective function. However, finding the absolute best solution is rarely necessary in practice. Fortunately, there exists a robust range of optimization algorithms that are reliable and easy to use. The local optima or approximate solutions discovered by these algorithms are typically highly effective and perfectly sufficient for training successful models.

Acceptability of Suboptimal Solutions in Deep Learning

For nonconvex objective functions, gradient descent can converge to a local minimum rather than the global minimum, and the particular local minimum reached depends on both the learning rate and the problem's conditioning. As an illustration, consider the function $$f(x) = x \cdot \cos(cx)$$ for a constant $$c$$, which possesses infinitely many local minima due to its oscillatory structure. When gradient descent is applied with an unrealistically high learning rate, the algorithm takes large steps that skip over better-quality minima and settles into a poor local minimum. This demonstrates that the learning rate not only affects convergence speed but also influences which solution gradient descent ultimately finds on nonconvex landscapes.

Learn Before

Related