Learn Before
Concept

Stochastic Gradient Descent Algorithm

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example w=wαwJ(xi,yi;w)w = w - \alpha \nabla_w J(x^i, y^i; w)

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

0

2

Updated 2021-10-03

Tags

Data Science