The gradient $\nabla_x f(x)$ of a scalar function $f(x_1, x_2, x_3, ..., x_n)$ is defined as the unique vector field whose dot product with any vector $v$ at each point $x$ is the directional derivative of $f$ along $v$. That is,
$ \nabla_x f(x) \cdot  v = \nabla_v f(x) $

The directional derivative in direction $v$ (a unit vector) is the slope of the function $f$ in direction $v$, namely the rate of increase of $f$ per unit of distance moved in the direction given by $v$. 

To minimize $f$, we would like to ﬁnd the direction in which $f$ decreases the fastest. We can do this using the directional derivative:
$\min_{v, v^Tv = 1} \nabla_x f(x) \cdot  v= \min_{v, v^Tv = 1} ||\nabla_x f(x)||_2 ||v||_2 \cos \theta$
where θ is the angle between $v$ and the gradient. Substituting in $||v||_2= 1$ and ignoring factors that do not depend on $v$, this simpliﬁes to $\min_{v}cos θ$.

This is minimized when $v$ points in the opposite direction as the gradient. In otherwords, the gradient points directly uphill, and the negative gradient points directly down hill. We can decrease $f$ by moving in the direction of the negative gradient.

Hence we have $x' = x-\alpha \frac{df(x)}{dx}$ where $\alpha$ is the learning rate, a positive scalar determining the size of the step.

Derivation of the Gradient Descent Formula

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

Epoch is every iteration of gradient descent through the entire training set.

Epoch in Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

For logistic regression, the gradient is given by ∂∂θjJ(θ)=1m∑mi=1(hθ(x(i))−y(i))x(i)j. Which of these is a correct gradient descent update for logistic regression with a learning rate of α?

Does adding polynomial features (e.g., instead using $h\theta(x)=g(\theta0+\theta1x1+\theta2x2+\theta3x21+\theta4x1x2+\theta5x22) )$ could increase how well we can fit the training data?

Suppose you have the following training set, and fit a logistic regression classifier $h\theta(x)=g(\theta0+\theta1x1+\theta2x2)$.

Backpropagation is a systematic computational procedure for applying the chain rule to calculate gradients automatically. It operates by traversing a computational graph in a backwards direction—from the output loss back to the input parameters—multiplying matrices of partial derivatives at each step to determine how parameters affect the final output.

Backpropagation

Batch gradient descent (batch size = $$N$$) produces low-noise gradient estimates and takes large, reliable steps toward the minimum. However, it may require considerable time per iteration and significant additional memory.

Stochastic gradient descent (batch size = $$1$$) is memory-efficient and well-suited for large datasets. However, it is extremely noisy because individual training examples may point in poor directions. SGD tends to oscillate and wander around the region of the minimum rather than converging directly to it.

Minibatch gradient descent (batch size between $$1$$ and $$N$$) offers a practical compromise. Although it does not guarantee monotonic progress toward the minimum, it tends to head more consistently in the right direction.

Experimentally, while SGD converges faster than batch GD in terms of the number of examples processed, it consumes more wall-clock time because computing the gradient example by example is computationally less efficient. Minibatch SGD balances convergence speed and computation efficiency: for instance, a batch size of $$100$$ can even outperform full-batch GD in runtime.

Batch vs Stochastic vs Mini-Batch Gradient Descent

If we only have two features, $$x_1$$ and $$x_2$$, in order to minimize the loss function, we can apply gradient descent to update $$w_1$$, $$w_2$$, and $$b$$. To compute the derivatives of $$\mathcal{L}(a, y)$$ with respect to $$w_1$$, $$w_2$$, and $$b$$, we need to compute the derivatives of $$\mathcal{L}(a, y)$$ with respect to $$a$$ and $$z$$ first. $$\mathcal{L}(a, y) = -(y \log(a) + (1 - y) \log(1 - a)) \Rightarrow$$ $$\frac{d\mathcal{L}(a, y)}{da} = -\frac{y}{a}+\frac{1-y}{1-a}$$ $$a = \sigma(z) = \frac{1}{1 + e^{-z}} \Rightarrow \frac{da}{dz} = a(1-a) \Rightarrow$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{dz} & = \frac{d\mathcal{L}(a, y)}{da}\frac{da}{dz} \\ & = \left(-\frac{y}{a}+\frac{1-y}{1-a}\right)(a(1-a)) = a-y \end{aligned}$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{dw_1} & = \frac{d\mathcal{L}(a, y)}{dz}\frac{dz}{dw_1} = (a-y)x_1 \end{aligned}$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{dw_2} & = \frac{d\mathcal{L}(a, y)}{dz}\frac{dz}{dw_2} = (a-y)x_2 \end{aligned}$$ $$\begin{aligned} \frac{d\mathcal{L}(a, y)}{db} & = \frac{d\mathcal{L}(a, y)}{dz}\frac{dz}{db} = (a-y) \cdot 1 = a-y \end{aligned}$$

Logistic Regression Gradient Descent Derivation

Assuming that the error function is $J(w)$ with one parameter $w$, to minimize the error, we can update the weight $w$ as follows.
$$w = w - \alpha * \frac{dJ(w)}{dw}$$
, where $\alpha$ is a learning rate, and $\frac{dJ(w)}{dw}$ is the derivative of $J(w)$ with respect to $w$.

If the error function has two or more parameters, for example, a weight $w$ and a bias $b$, we can update them one by one.
$$w = w - \alpha * \frac{\partial J(w,b)}{\partial w}$$
$$b = b - \alpha * \frac{\partial J(w,b)}{\partial b}$$
, where $\partial$ is a stylish cursive $d$, denoting the partial derivatives.

University of Michigan - Ann Arbor

To train the parameters W and B of the logistic regression model, you need to define a cost function.

$$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$$

$$=-\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}log(\hat{y}^{(i)}) + (1 - y^{(i)})log(1 - \hat{y}^{(i)})]$$

This loss function is Convex.

Logistic Regression Cost Function

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

- Loss function computes the error for a single example;
- Cost function computes the average of errors for the entire training set.

Logistic regression loss function vs. cost function

(Batch) Gradient Descent (Deep Learning Optimization Algorithm)

True or False: The cost function for logistic regression trained with m≥1 examples is always greater than or equal to zero.

A helpful website for understanding gradient descent:
https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Reference

Gradient descent is used to properly update parameters in order to maximize the efficiency of linear regressions. Below is a step-by-step example of how to implement gradient descent.

Linear Regression and Gradient Descent

You can check your derivative computation to make sure that your implementation of back propagation is correct. You want to consider BOTH the right hand side and the left hand side derivatives. By taking a two sided derivative, you can numerically verify whether or not the function g of theta is a correct implementation of the derivative of f.

Numerical Approximation of Gradients

This is the technique used to check that our implementation is correct. There are different formulas for gradient checking and one of those is two-sided form:
          $$\frac {J(\theta + \epsilon) - J(\theta - \epsilon)} {2\epsilon}$$
Common choice for $\epsilon$ is $10^{-7}$. You shouldn't use gradient checking for the whole training data as it can be slow.

Gradient Checking

https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Explained

Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much.

Consider the following "recliner chair" type of function(image below).

Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector, casuing the fail to find global optima. 


Why Gradient descent might fail?

DeepLearningAI. (2021, March 24). A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. YouTube. https://www.youtube.com/watch?v=06-AZXmwHjo

A Chat with Andrew on MLOps: From Model-centric to Data-centric AI

Sagar, R. (2021, April 6). Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric. Analytics India Magazine. https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/

Big Data to Good Data: Andrew Ng Urges ML Community To Be More Data-Centric and Less Model-Centric

Andrew Ng believes that for datasets with fewer than 10,000 data points, machine learning teams can make faster progress by focusing on improving data than improving code.

MLOps: Data-centric and Model-centric approaches

Points where the derivative of a function is 0 are known as critical points, or stationary points.

The derivative also has to be zero according to every possible directional derivative at that point in higher dimensional functions.

$$\triangledown f = 0$$

Critical Points

Optimization algorithms that use only the gradient, such as gradient descent,are called ﬁrst-order optimization algorithms.

First-order Optimization Algorithm

When the method of steepest descent is applied to a quadratic cost surface, it creates a zig-zag pattern that each line search direction is orthogonal to its previous line search direction.
Let $d_{t-1}$ be the previous search direction. At the minimum, we will find that:
$$\nabla _\theta f(\theta) \cdot d_{t-1} = 0$$
Then we know $d_t = \nabla _\theta f(\theta)$ is orthogonal to $d_{t-1}$

Method of Steepest Descent

Unlike first-order methods, second-order gradient methods use second-order derivatives. This improves optimization. 
- The most widely used second-order gradient method is the Newton method

Second-Order Gradient Methods

Gradient descent is an algorithm used to find the minimum value of a function. We will use the gradient descent algorithm to find the cost function. The idea of the algorithm is to randomly select a parameter combination at the beginning, calculate, and then find the next parameter combination that can reduce the cost function value the most, and continue to a minimum value.

https://machinelearningmastery.com/gradient-descent-for-machine-learning/

Gradient Descent Explanation

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function: Batch Gradient Descent, Stochastic Gradient Descent and Mini-batch Gradient Descent. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. 

Reference: https://ruder.io/optimizing-gradient-descent/

Gradient Descent Variants

Gradient descent methods use the slope of the surface. This will not necessarily point directly towards the extreme point. Local steepest direction may not be the same with the global optimum direction. 

Notes about gradient descent

Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

The training of BERT models follows a standard iterative optimization procedure used for deep neural networks. First, a large collection of training data is gathered. During each iteration, a random batch of these samples is selected, and the cumulative loss, $$\mathrm{Loss}_{\mathrm{BERT}}$$, is computed over the batch. Next, the model's parameters are updated to minimize this loss using an optimization algorithm like gradient descent or one of its variants. This cycle continues until a specific stopping condition is met, such as the convergence of the training loss.

BERT Training Process

An objective or scoring function can be the source of an inference failure when it does not assign a higher score to the correct output than to the system output. In that case, the learning algorithm that estimates the score should be improved rather than the search algorithm.

Objective Function

Distributed training is an approach used when a single processor or GPU lacks the computational capacity or memory to process large amounts of training data. By distributing the workload across multiple processors, optimization algorithms like stochastic gradient descent can aggregate computations. For example, training across $$1,024$$ GPUs with a small minibatch size of $$32$$ per GPU results in an aggregate minibatch of $$32,000$$ observations, dramatically accelerating training times for massive neural networks.

Distributed Training

If all parameters of a hidden layer are initialized to a constant, such as $$\mathbf{W}^{(1)} = c$$, every hidden unit will receive the same inputs and parameters, producing identical activations during forward propagation. Consequently, during backpropagation, the gradients of the output with respect to the parameters $$\mathbf{W}^{(1)}$$ will all take the exact same value. Because gradient-based algorithms like minibatch stochastic gradient descent update the parameters using these uniform gradients, all elements of $$\mathbf{W}^{(1)}$$ will continue to have identical values after every iteration. The hidden layer will thus behave as if it has only a single unit, failing to realize the network's expressive power.

The Problem with Constant Initialization

Assuming a sufficiently smooth objective function $$f$$ is Lipschitz continuous with constant $$L$$ (meaning that for any $$\mathbf{x}$$ and $$\mathbf{y}$$, the objective satisfies $$|f(\mathbf{x}) - f(\mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|$$), the change in the objective value after a gradient descent update $$\mathbf{x} \gets \mathbf{x} - \eta \mathbf{g}$$ is bounded by the inequality $$|f(\mathbf{x}) - f(\mathbf{x} - \eta\mathbf{g})| \leq L \eta\|\mathbf{g}\|$$. This bound demonstrates that the maximum change in the loss during a single step is constrained by the learning rate $$\eta$$, the gradient norm $$\|\mathbf{g}\|$$ , and the Lipschitz constant $$L$$. A small value for this upper bound presents a trade-off: it limits the speed at which the objective value can be reduced, but it advantageously limits how much progress can go wrong or be undone in any single gradient step.

Objective Function Change Bounds in Gradient Descent

One-dimensional gradient descent provides a clear illustration of why moving in the negative gradient direction reduces the objective function. For a continuously differentiable function $$f: \mathbb{R} ightarrow \mathbb{R}$$, the first-order Taylor expansion gives $$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)$$. Setting the step as $$\epsilon = -\eta f'(x)$$, where $$\eta > 0$$ is a fixed learning rate, yields $$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x))$$. When the derivative $$f'(x) 
eq 0$$, the term $$\eta f'^2(x) > 0$$ guarantees a decrease in $$f$$, provided $$\eta$$ is small enough for the higher-order terms to be negligible. This leads to the update rule $$x \leftarrow x - \eta f'(x)$$, which is applied iteratively from an initial value until a stopping condition is met, such as when the gradient magnitude $$|f'(x)|$$ becomes sufficiently small or a maximum number of iterations is reached.

One-Dimensional Gradient Descent

When the objective function maps a $$d$$-dimensional vector $$\mathbf{x} = [x_1, x_2, \ldots, x_d]^	op$$ to a scalar, i.e., $$f: \mathbb{R}^d 	o \mathbb{R}$$, its gradient becomes a vector of $$d$$ partial derivatives:

$$
abla f(\mathbf{x}) = \left[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}ight]^	op$$

Each component $$\partial f(\mathbf{x})/\partial x_i$$ captures the rate at which $$f$$ changes with respect to $$x_i$$ alone. Using the first-order multivariate Taylor expansion,

$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^	op 
abla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2)$$

one can show that the steepest-descent direction (up to second-order terms) is the negative gradient $$-
abla f(\mathbf{x})$$. Choosing a suitable learning rate $$\eta > 0$$ yields the multivariate gradient descent update rule:

$$\mathbf{x} \leftarrow \mathbf{x} - \eta 
abla f(\mathbf{x})$$

This directly generalizes the scalar update $$x \leftarrow x - \eta f'(x)$$ to vector-valued parameters.

Multivariate Gradient Descent

First-order optimization algorithms rely solely on the value and gradient of the objective function. In contrast, second-order optimization algorithms also utilize information about the function's curvature, often represented by the Hessian matrix. By accounting for curvature, these methods can automatically adjust the optimization step, providing a way to circumvent the difficulties of manually tuning a learning rate.

Second-Order Optimization Algorithm

In deep learning, the objective function $$f(\mathbf{x})$$ is typically formulated as the average of the individual loss functions $$f_i(\mathbf{x})$$ across the $$n$$ examples in the training dataset, where $$\mathbf{x}$$ is the parameter vector. This formulation is given by: $$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).$$ Consequently, the full gradient of the objective function at $$\mathbf{x}$$ is the average of the gradients for each example: $$
abla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n 
abla f_i(\mathbf{x}).$$

Average Objective Function in Deep Learning

Accelerated gradient methods, such as gradient descent with momentum, are a class of optimization algorithms that average over past gradients to obtain more stable directions of descent. They are particularly effective for solving ill-conditioned optimization problems, where the objective function landscape resembles a narrow canyon and progress in certain directions is much slower than in others.

Learn Before

Related

Learn After