A helpful website for understanding gradient descent:
https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Reference

Gradient descent is used to properly update parameters in order to maximize the efficiency of linear regressions. Below is a step-by-step example of how to implement gradient descent.

Linear Regression and Gradient Descent

You can check your derivative computation to make sure that your implementation of back propagation is correct. You want to consider BOTH the right hand side and the left hand side derivatives. By taking a two sided derivative, you can numerically verify whether or not the function g of theta is a correct implementation of the derivative of f.

Numerical Approximation of Gradients

This is the technique used to check that our implementation is correct. There are different formulas for gradient checking and one of those is two-sided form:
          $$\frac {J(\theta + \epsilon) - J(\theta - \epsilon)} {2\epsilon}$$
Common choice for $\epsilon$ is $10^{-7}$. You shouldn't use gradient checking for the whole training data as it can be slow.

Gradient Checking

Assuming that the error function is $J(w)$ with one parameter $w$, to minimize the error, we can update the weight $w$ as follows.
$$w = w - \alpha * \frac{dJ(w)}{dw}$$
, where $\alpha$ is a learning rate, and $\frac{dJ(w)}{dw}$ is the derivative of $J(w)$ with respect to $w$.

If the error function has two or more parameters, for example, a weight $w$ and a bias $b$, we can update them one by one.
$$w = w - \alpha * \frac{\partial J(w,b)}{\partial w}$$
$$b = b - \alpha * \frac{\partial J(w,b)}{\partial b}$$
, where $\partial$ is a stylish cursive $d$, denoting the partial derivatives.

(Batch) Gradient Descent (Deep Learning Optimization Algorithm)

https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Explained

Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much.

Consider the following "recliner chair" type of function(image below).

Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector, casuing the fail to find global optima. 


Why Gradient descent might fail?

DeepLearningAI. (2021, March 24). A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. YouTube. https://www.youtube.com/watch?v=06-AZXmwHjo

A Chat with Andrew on MLOps: From Model-centric to Data-centric AI

Sagar, R. (2021, April 6). Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric. Analytics India Magazine. https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/

Big Data to Good Data: Andrew Ng Urges ML Community To Be More Data-Centric and Less Model-Centric

Andrew Ng believes that for datasets with fewer than 10,000 data points, machine learning teams can make faster progress by focusing on improving data than improving code.

MLOps: Data-centric and Model-centric approaches

Points where the derivative of a function is 0 are known as critical points, or stationary points.

The derivative also has to be zero according to every possible directional derivative at that point in higher dimensional functions.

$$\triangledown f = 0$$

Critical Points

Optimization algorithms that use only the gradient, such as gradient descent,are called ﬁrst-order optimization algorithms.

First-order Optimization Algorithm

When the method of steepest descent is applied to a quadratic cost surface, it creates a zig-zag pattern that each line search direction is orthogonal to its previous line search direction.
Let $d_{t-1}$ be the previous search direction. At the minimum, we will find that:
$$\nabla _\theta f(\theta) \cdot d_{t-1} = 0$$
Then we know $d_t = \nabla _\theta f(\theta)$ is orthogonal to $d_{t-1}$

Method of Steepest Descent

Unlike first-order methods, second-order gradient methods use second-order derivatives. This improves optimization. 
- The most widely used second-order gradient method is the Newton method

Second-Order Gradient Methods

Gradient descent is an algorithm used to find the minimum value of a function. We will use the gradient descent algorithm to find the cost function. The idea of the algorithm is to randomly select a parameter combination at the beginning, calculate, and then find the next parameter combination that can reduce the cost function value the most, and continue to a minimum value.

https://machinelearningmastery.com/gradient-descent-for-machine-learning/

Gradient Descent Explanation

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function: Batch Gradient Descent, Stochastic Gradient Descent and Mini-batch Gradient Descent. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. 

Reference: https://ruder.io/optimizing-gradient-descent/

Gradient Descent Variants

Gradient descent methods use the slope of the surface. This will not necessarily point directly towards the extreme point. Local steepest direction may not be the same with the global optimum direction. 

Notes about gradient descent

Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

The training of BERT models follows a standard iterative optimization procedure used for deep neural networks. First, a large collection of training data is gathered. During each iteration, a random batch of these samples is selected, and the cumulative loss, $$\mathrm{Loss}_{\mathrm{BERT}}$$, is computed over the batch. Next, the model's parameters are updated to minimize this loss using an optimization algorithm like gradient descent or one of its variants. This cycle continues until a specific stopping condition is met, such as the convergence of the training loss.

BERT Training Process

An objective or scoring function can be the source of an inference failure when it does not assign a higher score to the correct output than to the system output. In that case, the learning algorithm that estimates the score should be improved rather than the search algorithm.

Objective Function

Distributed training is an approach used when a single processor or GPU lacks the computational capacity or memory to process large amounts of training data. By distributing the workload across multiple processors, optimization algorithms like stochastic gradient descent can aggregate computations. For example, training across $$1,024$$ GPUs with a small minibatch size of $$32$$ per GPU results in an aggregate minibatch of $$32,000$$ observations, dramatically accelerating training times for massive neural networks.

Distributed Training

If all parameters of a hidden layer are initialized to a constant, such as $$\mathbf{W}^{(1)} = c$$, every hidden unit will receive the same inputs and parameters, producing identical activations during forward propagation. Consequently, during backpropagation, the gradients of the output with respect to the parameters $$\mathbf{W}^{(1)}$$ will all take the exact same value. Because gradient-based algorithms like minibatch stochastic gradient descent update the parameters using these uniform gradients, all elements of $$\mathbf{W}^{(1)}$$ will continue to have identical values after every iteration. The hidden layer will thus behave as if it has only a single unit, failing to realize the network's expressive power.

The Problem with Constant Initialization

Assuming a sufficiently smooth objective function $$f$$ is Lipschitz continuous with constant $$L$$ (meaning that for any $$\mathbf{x}$$ and $$\mathbf{y}$$, the objective satisfies $$|f(\mathbf{x}) - f(\mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|$$), the change in the objective value after a gradient descent update $$\mathbf{x} \gets \mathbf{x} - \eta \mathbf{g}$$ is bounded by the inequality $$|f(\mathbf{x}) - f(\mathbf{x} - \eta\mathbf{g})| \leq L \eta\|\mathbf{g}\|$$. This bound demonstrates that the maximum change in the loss during a single step is constrained by the learning rate $$\eta$$, the gradient norm $$\|\mathbf{g}\|$$ , and the Lipschitz constant $$L$$. A small value for this upper bound presents a trade-off: it limits the speed at which the objective value can be reduced, but it advantageously limits how much progress can go wrong or be undone in any single gradient step.

Objective Function Change Bounds in Gradient Descent

One-dimensional gradient descent provides a clear illustration of why moving in the negative gradient direction reduces the objective function. For a continuously differentiable function $$f: \mathbb{R} ightarrow \mathbb{R}$$, the first-order Taylor expansion gives $$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)$$. Setting the step as $$\epsilon = -\eta f'(x)$$, where $$\eta > 0$$ is a fixed learning rate, yields $$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x))$$. When the derivative $$f'(x) 
eq 0$$, the term $$\eta f'^2(x) > 0$$ guarantees a decrease in $$f$$, provided $$\eta$$ is small enough for the higher-order terms to be negligible. This leads to the update rule $$x \leftarrow x - \eta f'(x)$$, which is applied iteratively from an initial value until a stopping condition is met, such as when the gradient magnitude $$|f'(x)|$$ becomes sufficiently small or a maximum number of iterations is reached.

One-Dimensional Gradient Descent

When the objective function maps a $$d$$-dimensional vector $$\mathbf{x} = [x_1, x_2, \ldots, x_d]^	op$$ to a scalar, i.e., $$f: \mathbb{R}^d 	o \mathbb{R}$$, its gradient becomes a vector of $$d$$ partial derivatives:

$$
abla f(\mathbf{x}) = \left[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}ight]^	op$$

Each component $$\partial f(\mathbf{x})/\partial x_i$$ captures the rate at which $$f$$ changes with respect to $$x_i$$ alone. Using the first-order multivariate Taylor expansion,

$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^	op 
abla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2)$$

one can show that the steepest-descent direction (up to second-order terms) is the negative gradient $$-
abla f(\mathbf{x})$$. Choosing a suitable learning rate $$\eta > 0$$ yields the multivariate gradient descent update rule:

$$\mathbf{x} \leftarrow \mathbf{x} - \eta 
abla f(\mathbf{x})$$

This directly generalizes the scalar update $$x \leftarrow x - \eta f'(x)$$ to vector-valued parameters.

Multivariate Gradient Descent

First-order optimization algorithms rely solely on the value and gradient of the objective function. In contrast, second-order optimization algorithms also utilize information about the function's curvature, often represented by the Hessian matrix. By accounting for curvature, these methods can automatically adjust the optimization step, providing a way to circumvent the difficulties of manually tuning a learning rate.

Second-Order Optimization Algorithm

In deep learning, the objective function $$f(\mathbf{x})$$ is typically formulated as the average of the individual loss functions $$f_i(\mathbf{x})$$ across the $$n$$ examples in the training dataset, where $$\mathbf{x}$$ is the parameter vector. This formulation is given by: $$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).$$ Consequently, the full gradient of the objective function at $$\mathbf{x}$$ is the average of the gradients for each example: $$
abla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n 
abla f_i(\mathbf{x}).$$

Average Objective Function in Deep Learning

Accelerated gradient methods, such as gradient descent with momentum, are a class of optimization algorithms that average over past gradients to obtain more stable directions of descent. They are particularly effective for solving ill-conditioned optimization problems, where the objective function landscape resembles a narrow canyon and progress in certain directions is much slower than in others.

Accelerated Gradient Methods

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

University of Michigan - Ann Arbor

Claude

When a neural network trains, it uses an algorithm to determine the optimal weights for the model, called an optimizer. 
There are several optimizer algorithms, such as:
- Gradient descent
- Mini-batch gradient descent
- Gradient descent with momentum
- RMSprop
- Adam optimization algorithm
- Nesterov momentum
- AdaGrad

Deep Learning Optimizer Algorithms

The derivative of a function $$f: \mathbb{R} ightarrow \mathbb{R}$$ at a point $$x$$ represents the rate of change in the function with respect to changes in its arguments. Formally, it is defined as the limit of the ratio between a perturbation $$h$$ and the change in the function value as $$h$$ approaches zero: $$f'(x) = \lim_{h ightarrow 0} \frac{f(x+h) - f(x)}{h}$$. The derivative essentially measures how rapidly a function's value would increase or decrease given an infinitesimally small adjustment to its parameter.

Derivative of a Scalar Function

A helpful website that introduces neural networks:
https://missinglink.ai/guides/neural-network-concepts/

Neural Network Reference

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Dive into Deep Learning

Batch Gradient Descent requires the entire dataset to be processed to complete one step. Gradient Descent can require many steps in some cases which causes this procedure to be very slow and inefficient for large datasets.

Mini-Batch Gradient Descent solves this problem by taking small groups of random data points called mini-batches and using them to estimate each step.  The mini-batches are usually a size greater than 1 but less than N (dataset size). The smaller steps allow for a much faster algorithm.

Mini-Batch Gradient Descent

The basic idea of gradient descent with momentum is to compute an exponentially weighted average of your gradients, and then use that gradient to update your weights instead. It almost always works faster than the standard gradient descent algorithm.


Gradient Descent with Momentum

Here is a very helpful article on different types of optimizer algorithms
https://ruder.io/optimizing-gradient-descent/index.html

An overview of gradient descent optimization algorithms

Learning rate decay is the gradual reduction of the learning rate as a function of time to speed up the learning algorithm. Decaying the learning rate as the gradient descent approaches completion reduces noise and facilitates a tighter convergence to a target.

Learning Rate Decay

Gradient Descent

Adam stands for adaptive moment estimation.
It combines gradient descent with momentum, and RMSProp. It brings the benefits from both sides - adaptive learning rate and faster convergence with momentum.

Adam (Deep Learning Optimization Algorithm)

- Stands for Root Mean Square Propagation
- RMSProp is an optimization algorithm closely related to AdaGrad, as both employ the square of the gradient to scale the update coefficients on a per-coordinate basis. However, RMSProp overcomes AdaGrad's tendency for radically diminishing learning rates by using a leaky (exponentially weighted) average of squared gradients rather than a cumulative sum.
- RMSProp also shares the leaky averaging mechanism with the momentum method, but applies it differently: whereas momentum uses leaky averaging to smooth the gradient direction, RMSProp uses the technique to adjust the coefficient-wise preconditioner that rescales the learning rate independently for each parameter.
- Because RMSProp does not automatically schedule the learning rate (unlike AdaGrad, whose learning rate decays implicitly through accumulation), the learning rate must be explicitly scheduled by the practitioner in practice.
- The decay coefficient $$\gamma$$ governs how long the gradient history is retained when adjusting the per-coordinate scale: a larger $$\gamma$$ produces a longer memory, while a smaller $$\gamma$$ makes the algorithm more responsive to recent gradients.

RMSprop (Deep Learning Optimization Algorithm)

In the momentum method we basically first moved our weight in the direction of the current gradient and then moved in the direction  of momentum (weighted sum of all previous steps). Now in the new method we first move in the direction of the momentum and then calculate the gradient at the new point. Using this gradient we move in the direction of the new gradient. 

The update rules are as follows:
$$v \leftarrow \alpha v - \epsilon \nabla_{\theta} [\frac{1}{m} \sum^{m}_{i=1} L(f(x^{(i)};\theta + \alpha v), y^{(i)})]$$
$$\theta \leftarrow \theta + v$$

Nesterov momentum (Deep Learning Optimization Algorithm)

- **Local optima**: it's actually unlikely to get stuck in local optima.
- **Cliffs**: on the face of an extremely steep cliﬀ structure, the
gradient update step can move the parameters extremely far
- **Inexact Gradients**: sometimes approximation is needed for gradients
- **Plateaus**: low cost function slope (close to flat) makes learning slow.

Challenges with Deep Learning Optimizer Algorithms

Adam stands for: adaptive moment estimation. Briefly, this method combines momentum and RMSprop (root mean squared prop).
Like momentum alone, RMSprop smooths the gradient, (it takes RMSProp and applies momentum to the rescaled gradients). This alternative approach is best explained mathematically:

Adam introduces four hyperparameters:
- learning rate alpha
- beta from momentum (usually 0.9)
- beta2 from RMSprop (usually 0.999)
- epsilon (usually 1e-8)

As mentioned above, you usually do not need to tune beta, beta2, and epsilon as the values listed above will generally work well. Only the learning rate is left to tune in order to accelerate training.


Adam combines the advantages of AdaGrad and RMSProp these two optimization algorithms. It comprehensively considers the first moment estimation of the gradient (First Moment Estimation, the mean value of the gradient) and the second moment estimation (Second Moment Estimation, the uncentered variance of the gradient), and calculates the update step size.

Adam optimization algorithm


Adam is different to classical stochastic gradient descent (SGD). SGD maintains a single learning rate (alpha) for all weight updates and the learning rate does not change during training. Adam combines the advantages of AdaGrad and RMSProp. It not only adapts the parameter learning rates based on the average first moment (the mean) as in RMAProp, but also makes use of the average of the second moments of the gradients (the uncentered variance).

Difference between Adam and SGD

The Adagrad optimization algorithm addresses the difficulty of tuning learning rates for sparse features by replacing simple feature occurrence counters with an aggregate of the squares of previously observed gradients. Specifically, it uses $$s(i, t+1) = s(i, t) + \left(\partial_i f(\mathbf{x})\right)^2$$ to adjust the learning rate. This automatically scales down the step size significantly for coordinates that frequently have large gradients, while applying a gentler treatment to coordinates with small gradients, thereby eliminating the need to manually decide when a gradient is considered large enough.

Adagrad

Adadelta is an optimization algorithm that has no explicit learning rate parameter. Instead, it uses the rate of change in the parameters themselves to dynamically adapt the learning rate. To accomplish this, the algorithm utilizes two specific state variables: $$\mathbf{s}_t$$ to track a leaky average of the second moment of the gradient, and $$\Delta\mathbf{x}_t$$ to track a leaky average of the second moment of the model's parameter changes. The algorithm retains standard naming conventions for these variables to maintain consistency with similar optimization methods like momentum, AdaGrad, and RMSProp.

Adadelta

On a straight line, the function's derivative...

https://www.wired.com/2015/04/crash-course-derivatives/

A crash course of derivatives

A derivative of a derivative is known as second derivative. Second derivative is often used to measure curvature.

Second Derivative

Hessian matrix is defined as a matrix containing second derivatives of function having multiple input dimensions. It's defined such that
$H(f)(x)_{i,j} = \frac{\partial^2}{\partial x_i \partial x_j}f(x)$

Hessian Matrix 

A Lipschitz continuous function is a functionfwhose rateof change is bounded by a Lipschitz constant $L$:
$L = \forall x \forall y. |f(x)-f(y)| \leq L ||x-y||_2$

Lipschitz Continuous

Functions that are composed of other differentiable functions can be differentiated using systematic rules. For differentiable functions $$f(x)$$ and $$g(x)$$ and a constant $$C$$, the primary rules are:
- Constant multiple rule: $$\frac{d}{dx} [C f(x)] = C \frac{d}{dx} f(x)$$
- Sum rule: $$\frac{d}{dx} [f(x) + g(x)] = \frac{d}{dx} f(x) + \frac{d}{dx} g(x)$$
- Product rule: $$\frac{d}{dx} [f(x) g(x)] = f(x) \frac{d}{dx} g(x) + g(x) \frac{d}{dx} f(x)$$
- Quotient rule: $$\frac{d}{dx} \frac{f(x)}{g(x)} = \frac{g(x) \frac{d}{dx} f(x) - f(x) \frac{d}{dx} g(x)}{g^2(x)}$$

Differentiation Rules

Several fundamental functions have well-known derivatives that serve as building blocks for more complex calculus operations. For any constant $$C$$ and any power $$n 
eq 0$$, the standard derivatives are:
- Constant: $$\frac{d}{dx} C = 0$$
- Power: $$\frac{d}{dx} x^n = n x^{n-1}$$
- Exponential: $$\frac{d}{dx} e^x = e^x$$
- Natural Logarithm: $$\frac{d}{dx} \ln x = x^{-1}$$

Derivatives of Common Functions

For functions of a single variable, the chain rule is used to compute the derivative of deeply nested functions. Suppose that $$y = f(g(x))$$ is a composite function, and that the underlying functions $$y = f(u)$$ and $$u = g(x)$$ are both differentiable. The chain rule states that the derivative of $$y$$ with respect to $$x$$ is the product of the derivative of $$y$$ with respect to $$u$$ and the derivative of $$u$$ with respect to $$x$$:

$$\frac{dy}{dx} = \frac{dy}{du} \frac{du}{dx}$$

Chain Rule for Single-Variable Functions

When a multivariate function outputs a vector rather than a scalar, the most natural representation of its derivative is the Jacobian matrix. Specifically, the derivative of a vector $$\mathbf{y}$$ with respect to an input vector $$\mathbf{x}$$ is a matrix that compiles the partial derivatives of each individual component of $$\mathbf{y}$$ with respect to each component of $$\mathbf{x}$$.

Jacobian Matrix

The partial derivative of a multivariate function $$y = f(x_1, x_2, \ldots, x_n)$$ with respect to its $$i^	extrm{th}$$ parameter $$x_i$$ measures how the function changes as $$x_i$$ increases while treating all other variables as constants. Formally, it is defined as the limit: $$\frac{\partial y}{\partial x_i} = \lim_{h ightarrow 0} \frac{f(x_1, \ldots, x_{i-1}, x_i+h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$$.

Partial Derivative

The gradient of a scalar-valued function with respect to a vector $$\mathbf{x}$$ is itself a vector-valued output that has the identical shape as the input vector $$\mathbf{x}$$. This property ensures that each element of the gradient vector directly corresponds to the partial derivative of the scalar output with respect to the matching element in the input vector.

Gradient of a Scalar-Valued Function with Respect to a Vector

Denote the function as $f(x)$, $g$ is the gradient and $H$ is is the Hessian at $x^{(0)}$. We calculate the new point $x = x^{(0)} - \epsilon g$. We can obtian that $f(x^{(0)} - \epsilon g) \approx f(x^{(0)}) - \epsilon g^Tg + \frac{1}{2} \epsilon^2 g^THg$ According to the above equation, the optimal step size when $g^THg$ is positive is $\epsilon^* = \frac{g^Tg}{g^THg}$

Learn Before

Related

Learn After