Concept

Derivation of the Gradient Descent Formula

The gradient xf(x)\nabla_x f(x) of a scalar function f(x1,x2,x3,...,xn)f(x_1, x_2, x_3, ..., x_n) is defined as the unique vector field whose dot product with any vector vv at each point xx is the directional derivative of ff along vv. That is, xf(x)v=vf(x)\nabla_x f(x) \cdot v = \nabla_v f(x)

The directional derivative in direction vv (a unit vector) is the slope of the function ff in direction vv, namely the rate of increase of ff per unit of distance moved in the direction given by vv.

To minimize ff, we would like to find the direction in which ff decreases the fastest. We can do this using the directional derivative: minv,vTv=1xf(x)v=minv,vTv=1xf(x)2v2cosθ\min_{v, v^Tv = 1} \nabla_x f(x) \cdot v= \min_{v, v^Tv = 1} ||\nabla_x f(x)||_2 ||v||_2 \cos \theta where θ is the angle between vv and the gradient. Substituting in v2=1||v||_2= 1 and ignoring factors that do not depend on vv, this simplifies to minvcosθ\min_{v}cos θ.

This is minimized when vv points in the opposite direction as the gradient. In otherwords, the gradient points directly uphill, and the negative gradient points directly down hill. We can decrease ff by moving in the direction of the negative gradient.

Hence we have x=xαdf(x)dxx' = x-\alpha \frac{df(x)}{dx} where α\alpha is the learning rate, a positive scalar determining the size of the step.

0

1

Updated 2020-11-16

References


Tags

Data Science