If you have a huge training set with 5000000 training samples,
$X=[ x^{(1)},  x^{(2)}  ,...,  x^{(5000000)}]$
Let's say each of your baby training sets have just 1,000 examples each. So, you take the first mini-batch as $X^{\{ 1\}} =[x^{(1)},  x^{(2)}  ,...,  x^{(1000)}] $. And then you take home the next 1,000 samples $X^{\{2\}} =[x^{(1001)},  ...,  x^{(2000)}] $ and so on.

Altogether you would have 5,000 of these mini-batches and then similarly you do the same thing for Y. Hence we end up with mini-batches $X^{\{ T\}}, Y^{\{ T\}}$, T = 1,2...,5000.

An Example of Mini-Batches

First, we take a constant learning rate represented by the blue line. We see that, as we iterate, the steps are large and noisy and do not converge on a minimum. Instead, it wanders around the minimum.

Next, we take a decaying learning rate represented by the green line. At the start, the learning rate takes large steps with each iteration. But the learning rate is reduced or decayed as it approaches the minimum. This slower learning rate takes smaller tighter steps around the minimum and is closer to convergence.

This method allows us to have relatively fast learning during the initial phases with large steps, but also converge to a minimum during the final phases with slower learning rates and smaller steps.

Example Using Mini-Batch Gradient Descent (Learning Rate Decay)

Which of these statements about mini-batch gradient descent do you agree with?

Why is the best mini-batch size usually not 1 and not m, but instead something in-between?

Which of the following do you agree with?

Suppose your learning algorithm’s cost J, plotted as a function of the number of iterations, looks like the image below:

If we choose the mini-batch size to be 1, then it gives the algorithm called Stochastic Gradient Descent or SGD.

In this case, on every iteration, you're taking gradient descent with just a single training example
$w = w - \alpha \nabla_w J(x^i, y^i; w)$

The most important property of SGD is that computation time per step does not grow with the number of examples. This makes SGD very efficient with large training sets.

The learning rate is a hyperparameter that must be adjusted. Unlike regular parameters of a model (weights like w and b), which are learned by the algorithm from the training set, hyperparameters are special parameters chosen by the algorithm designer that affect how the algorithm works.

Stochastic Gradient Descent Algorithm

The expression $$\frac{\partial L_{\theta_t}(\mathcal{D}_{\mathrm{mini}})}{\partial \theta_t}$$ represents the gradient of the loss function, $$L$$, with respect to the model parameters, $$\theta_t$$. This gradient is computed on a specific mini-batch of training samples, $$\mathcal{D}_{\mathrm{mini}}$$, and indicates the direction of the steepest increase in the loss for that batch.

Loss Gradient over a Mini-batch

Although increasing the minibatch size $$\mathcal{B}_t$$ reduces the variance of gradient estimates, this benefit exhibits diminishing returns. Beyond a certain point, the additional reduction in standard deviation becomes minimal relative to the linear increase in computational cost per iteration. Therefore, in practice, the minibatch size is chosen to be large enough to offer good computational efficiency and stable gradient estimates, while still fitting within the memory constraints of the hardware, such as a GPU.

Minibatch Size Selection Trade-off

Batch gradient descent (batch size = $$N$$) produces low-noise gradient estimates and takes large, reliable steps toward the minimum. However, it may require considerable time per iteration and significant additional memory.

Stochastic gradient descent (batch size = $$1$$) is memory-efficient and well-suited for large datasets. However, it is extremely noisy because individual training examples may point in poor directions. SGD tends to oscillate and wander around the region of the minimum rather than converging directly to it.

Minibatch gradient descent (batch size between $$1$$ and $$N$$) offers a practical compromise. Although it does not guarantee monotonic progress toward the minimum, it tends to head more consistently in the right direction.

Experimentally, while SGD converges faster than batch GD in terms of the number of examples processed, it consumes more wall-clock time because computing the gradient example by example is computationally less efficient. Minibatch SGD balances convergence speed and computation efficiency: for instance, a batch size of $$100$$ can even outperform full-batch GD in runtime.

Batch vs Stochastic vs Mini-Batch Gradient Descent

For $$t = 1, 2, \ldots, N$$ (where $$N$$ is the number of mini-batches):

1. Forward propagate on mini-batch $$X^{\{t\}}$$.
2. Compute the cost function $$J^{\{t\}}$$ for that mini-batch.
3. Backpropagate to compute gradients with respect to $$J^{\{t\}}$$, using $$X^{\{t\}}$$ and $$Y^{\{t\}}$$.
4. Update parameters: $$W^{[l]} = W^{[l]} - \alpha \, dW^{[l]}$$, $$b^{[l]} = b^{[l]} - \alpha \, db^{[l]}$$, where $$\alpha$$ is the learning rate and $$l$$ indexes each layer.

One complete pass through all $$N$$ mini-batches constitutes one epoch of training.

Mini-Batch Gradient Descent Algorithm

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

University of Michigan - Ann Arbor

Claude

Google

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

Assuming that the error function is $J(w)$ with one parameter $w$, to minimize the error, we can update the weight $w$ as follows.
$$w = w - \alpha * \frac{dJ(w)}{dw}$$
, where $\alpha$ is a learning rate, and $\frac{dJ(w)}{dw}$ is the derivative of $J(w)$ with respect to $w$.

If the error function has two or more parameters, for example, a weight $w$ and a bias $b$, we can update them one by one.
$$w = w - \alpha * \frac{\partial J(w,b)}{\partial w}$$
$$b = b - \alpha * \frac{\partial J(w,b)}{\partial b}$$
, where $\partial$ is a stylish cursive $d$, denoting the partial derivatives.

(Batch) Gradient Descent (Deep Learning Optimization Algorithm)

Machine Learning Yearning, a free ebook from Andrew Ng, teaches you how to structure Machine Learning projects.
https://www.deeplearning.ai/machine-learning-yearning/

Note: The content of the book is aligned with the Coursera Deeplearning.ai specialization.  https://www.deeplearning.ai/deep-learning-specialization/ 

Machine Learning Yearning (Deeplearning.ai)

Reference of Foundations of Large Language Models Course

Dive into Deep Learning

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

- Stands for Root Mean Square Propagation
- RMSProp is an optimization algorithm closely related to AdaGrad, as both employ the square of the gradient to scale the update coefficients on a per-coordinate basis. However, RMSProp overcomes AdaGrad's tendency for radically diminishing learning rates by using a leaky (exponentially weighted) average of squared gradients rather than a cumulative sum.
- RMSProp also shares the leaky averaging mechanism with the momentum method, but applies it differently: whereas momentum uses leaky averaging to smooth the gradient direction, RMSProp uses the technique to adjust the coefficient-wise preconditioner that rescales the learning rate independently for each parameter.
- Because RMSProp does not automatically schedule the learning rate (unlike AdaGrad, whose learning rate decays implicitly through accumulation), the learning rate must be explicitly scheduled by the practitioner in practice.
- The decay coefficient $$\gamma$$ governs how long the gradient history is retained when adjusting the per-coordinate scale: a larger $$\gamma$$ produces a longer memory, while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients.

RMSprop (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

Difference between Adam and SGD

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

Adadelta is an optimization algorithm that has no explicit learning rate parameter. Instead, it uses the rate of change in the parameters themselves to dynamically adapt the learning rate. To accomplish this, the algorithm utilizes two specific state variables: $$\mathbf{s}_t$$ to track a leaky average of the second moment of the gradient, and $$\Delta\mathbf{x}_t$$ to track a leaky average of the second moment of the model's parameter changes. The algorithm retains standard naming conventions for these variables to maintain consistency with similar optimization methods like momentum, AdaGrad, and RMSProp.

Adadelta

The gradient $\nabla_x f(x)$ of a scalar function $f(x_1, x_2, x_3, ..., x_n)$ is defined as the unique vector field whose dot product with any vector $v$ at each point $x$ is the directional derivative of $f$ along $v$. That is,
$ \nabla_x f(x) \cdot  v = \nabla_v f(x) $

The directional derivative in direction $v$ (a unit vector) is the slope of the function $f$ in direction $v$, namely the rate of increase of $f$ per unit of distance moved in the direction given by $v$. 

To minimize $f$, we would like to ﬁnd the direction in which $f$ decreases the fastest. We can do this using the directional derivative:
$\min_{v, v^Tv = 1} \nabla_x f(x) \cdot  v= \min_{v, v^Tv = 1} ||\nabla_x f(x)||_2 ||v||_2 \cos \theta$
where θ is the angle between $v$ and the gradient. Substituting in $||v||_2= 1$ and ignoring factors that do not depend on $v$, this simpliﬁes to $\min_{v}cos θ$.

This is minimized when $v$ points in the opposite direction as the gradient. In otherwords, the gradient points directly uphill, and the negative gradient points directly down hill. We can decrease $f$ by moving in the direction of the negative gradient.

Hence we have $x' = x-\alpha \frac{df(x)}{dx}$ where $\alpha$ is the learning rate, a positive scalar determining the size of the step.

Derivation of the Gradient Descent Formula

Epoch is every iteration of gradient descent through the entire training set.

Epoch in Gradient Descent

For logistic regression, the gradient is given by ∂∂θjJ(θ)=1m∑mi=1(hθ(x(i))−y(i))x(i)j. Which of these is a correct gradient descent update for logistic regression with a learning rate of α?

Does adding polynomial features (e.g., instead using $h\theta(x)=g(\theta0+\theta1x1+\theta2x2+\theta3x21+\theta4x1x2+\theta5x22) )$ could increase how well we can fit the training data?

Suppose you have the following training set, and fit a logistic regression classifier $h\theta(x)=g(\theta0+\theta1x1+\theta2x2)$.

Backpropagation is a systematic computational procedure for applying the chain rule to calculate gradients automatically. It operates by traversing a computational graph in a backwards direction—from the output loss back to the input parameters—multiplying matrices of partial derivatives at each step to determine how parameters affect the final output.

Backpropagation

If we only have two features, $$x_1$$ and $$x_2$$, in order to minimize the loss function, we can apply gradient descent to update $$w_1$$, $$w_2$$, and $$b$$. To compute the derivatives of $$\mathcal{L}(a, y)$$ with respect to $$w_1$$, $$w_2$$, and $$b$$, we need to compute the derivatives of $$\mathcal{L}(a, y)$$ with respect to $$a$$ and $$z$$ first. $$\mathcal{L}(a, y) = -(y \log(a) + (1 - y) \log(1 - a)) \Rightarrow$$ $$\frac{d\mathcal{L}(a, y)}{da} = -\frac{y}{a}+\frac{1-y}{1-a}$$ $$a = \sigma(z) = \frac{1}{1 + e^{-z}} \Rightarrow \frac{da}{dz} = a(1-a) \Rightarrow$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{dz} & = \frac{d\mathcal{L}(a, y)}{da}\frac{da}{dz} \\ & = \left(-\frac{y}{a}+\frac{1-y}{1-a}\right)(a(1-a)) = a-y \end{aligned}$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{dw_1} & = \frac{d\mathcal{L}(a, y)}{dz}\frac{dz}{dw_1} = (a-y)x_1 \end{aligned}$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{dw_2} & = \frac{d\mathcal{L}(a, y)}{dz}\frac{dz}{dw_2} = (a-y)x_2 \end{aligned}$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{db} & = \frac{d\mathcal{L}(a, y)}{dz}\frac{dz}{db} = (a-y) \cdot 1 = a-y \end{aligned}$$

Learn Before

Related

Learn After