Mini-Batch Gradient Descent
Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.
Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step. The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.
0
2
Tags
Data Science
Foundations of Large Language Models Course
Computing Sciences
D2L
Dive into Deep Learning @ D2L
Related
Mini-Batch Gradient Descent
Gradient Descent with Momentum
An overview of gradient descent optimization algorithms
Learning Rate Decay
Gradient Descent
AdaDelta (Deep Learning Optimization Algorithm)
Adam (Deep Learning Optimization Algorithm)
RMSprop (Deep Learning Optimization Algorithm)
AdaGrad (Deep Learning Optimization Algorithm)
Nesterov momentum (Deep Learning Optimization Algorithm)
Challenges with Deep Learning Optimizer Algorithms
Adam optimization algorithm
Difference between Adam and SGD
Logistic regression gradient descent
Derivation of the Gradient Descent Formula
Mini-Batch Gradient Descent
Epoch in Gradient Descent
Batch vs Stochastic vs Mini-Batch Gradient Descent
Gradient Descent with Momentum
For logistic regression, the gradient is given by ∂∂θjJ(θ)=1m∑mi=1(hθ(x(i))−y(i))x(i)j. Which of these is a correct gradient descent update for logistic regression with a learning rate of α?
Suppose you have the following training set, and fit a logistic regression classifier .
Backpropagation
Learn After
An Example of Mini-Batches
Mini-Batch Gradient Descent Algorithm
Batch vs Stochastic vs Mini-Batch Gradient Descent
Example Using Mini-Batch Gradient Descent (Learning Rate Decay)
Mini-Batches Size
Which of these statements about mini-batch gradient descent do you agree with?
Why is the best mini-batch size usually not 1 and not m, but instead something in-between?
Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:
Stochastic Gradient Descent Algorithm
Loss Gradient over a Mini-batch