1Cademy - Derivation of the Gradient Descent Formula

Learn Before

(Batch) Gradient Descent (Deep Learning Optimization Algorithm)

Concept

Derivation of the Gradient Descent Formula

The gradient $\nabla_x f(x)$ of a scalar function $f(x_1, x_2, x_3, ..., x_n)$ is defined as the unique vector field whose dot product with any vector $v$ at each point $x$ is the directional derivative of $f$ along $v$ . That is, $\nabla_x f(x) \cdot v = \nabla_v f(x)$

The directional derivative in direction $v$ (a unit vector) is the slope of the function $f$ in direction $v$ , namely the rate of increase of $f$ per unit of distance moved in the direction given by $v$ .

To minimize $f$ , we would like to ﬁnd the direction in which $f$ decreases the fastest. We can do this using the directional derivative: $\min_{v, v^Tv = 1} \nabla_x f(x) \cdot v= \min_{v, v^Tv = 1} ||\nabla_x f(x)||_2 ||v||_2 \cos \theta$ where θ is the angle between $v$ and the gradient. Substituting in $||v||_2= 1$ and ignoring factors that do not depend on $v$ , this simpliﬁes to $\min_{v}cos θ$ .

This is minimized when $v$ points in the opposite direction as the gradient. In otherwords, the gradient points directly uphill, and the negative gradient points directly down hill. We can decrease $f$ by moving in the direction of the negative gradient.

Hence we have $x' = x-\alpha \frac{df(x)}{dx}$ where $\alpha$ is the learning rate, a positive scalar determining the size of the step.