Learn Before
Stochastic Gradient Descent Algorithm
If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.
In this case, on every iteration, you're taking gradient descent with just a single training example
The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.
The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.
0
2
Tags
Data Science
Related
An Example of Mini-Batches
Mini-Batch Gradient Descent Algorithm
Batch vs Stochastic vs Mini-Batch Gradient Descent
Example Using Mini-Batch Gradient Descent (Learning Rate Decay)
Mini-Batches Size
Which of these statements about mini-batch gradient descent do you agree with?
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:
Stochastic Gradient Descent Algorithm
Loss Gradient over a Mini-batch