Newton's Method is a second-order optimization algorithm that relies on the second-order Taylor expansion of a multivariate function $$f(\mathbf{x})$$. By taking the expansion $$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^	op 
abla f(\mathbf{x}) + \frac{1}{2} \boldsymbol{\epsilon}^	op \mathbf{H} \boldsymbol{\epsilon} + \mathcal{O}(\|\boldsymbol{\epsilon}\|^3)$$, where $$\mathbf{H} = 
abla^2 f(\mathbf{x})$$ is the Hessian matrix, and setting the derivative with respect to the update step $$\boldsymbol{\epsilon}$$ to zero ($$
abla f(\mathbf{x}) + \mathbf{H} \boldsymbol{\epsilon} = 0$$), the algorithm derives the optimal update step as $$\boldsymbol{\epsilon} = -\mathbf{H}^{-1} 
abla f(\mathbf{x})$$. This approach requires computing and inverting the Hessian matrix to directly jump toward the function's minimum.

Newton's Method

While second-order optimization algorithms, such as Newton's Method, offer the theoretical advantage of using curvature to determine step sizes, they are generally impractical for deep neural networks. The primary limitation is the prohibitive computational cost associated with the Hessian matrix, $$\mathbf{H}$$. For a model with $$d$$ parameters, the Hessian requires storing $$\mathcal{O}(d^2)$$ entries, and computing it via backpropagation is excessively expensive, making the direct application of pure second-order methods infeasible for large-scale deep learning tasks.

Applicability of Second-Order Methods in Deep Learning

The conjugate gradient method is an optimization algorithm that is typically faster than the method of steepest descent, and it avoids the calculation of the inverse Hessian matrix required by Newton's method. Instead of undoing direction search progress made previously and recalculating each step, the conjugate gradient method looks for a search direction that is conjugate to the previous line search direction. At iteration $$t$$, the next search direction $$d_t$$ is:

$$d_t = \nabla _\theta f(\theta_t) + \beta _t d_{t-1}$$

where $$\beta _t$$ is a coefficient that controls the direction. Two popular ways to calculate $$\beta _t$$ are the Fletcher-Reeves formula:

$$\beta _t = \frac{\nabla _\theta f(\theta _t)^\top \nabla _\theta f(\theta _t)}{\nabla _\theta f(\theta _{t-1})^\top \nabla _\theta f(\theta _{t-1})}$$

and the Polak-Ribière formula:

$$\beta _t = \frac{(\nabla _\theta f(\theta _t) - \nabla_\theta f(\theta _{t-1}))^\top \nabla _\theta f(\theta _t)}{\nabla _\theta f(\theta _{t-1})^\top \nabla _\theta f(\theta _{t-1})}$$

Conjugate Gradient Method

First-order optimization algorithms rely solely on the value and gradient of the objective function. In contrast, second-order optimization algorithms also utilize information about the function's curvature, often represented by the Hessian matrix. By accounting for curvature, these methods can automatically adjust the optimization step, providing a way to circumvent the difficulties of manually tuning a learning rate.

University of Michigan - Ann Arbor

Claude

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Optimization algorithms that use only the gradient, such as gradient descent,are called ﬁrst-order optimization algorithms.

First-order Optimization Algorithm

Hessian matrix is defined as a matrix containing second derivatives of function having multiple input dimensions. It's defined such that
$H(f)(x)_{i,j} = \frac{\partial^2}{\partial x_i \partial x_j}f(x)$

Hessian Matrix 

Adaptive optimization methods aim to address the notoriously difficult problem of selecting an optimal learning rate, $$\eta$$. Because a learning rate that is too small leads to excessively slow progress and one that is too large causes oscillation or divergence, adaptive methods attempt to determine $$\eta$$ automatically or eliminate the need for manual tuning entirely during the optimization process.

Adaptive Optimization Methods

An objective or scoring function can be the source of an inference failure when it does not assign a higher score to the correct output than to the system output. In that case, the learning algorithm that estimates the score should be improved rather than the search algorithm.

Objective Function

In the context of optimization, the curvature of an objective function refers to the rate at which its gradient changes, conceptually captured by its second-order derivatives. Geometrically, curvature indicates how rapidly the surface of the objective function bends. Understanding this property provides useful intuition for adjusting optimization step sizes: in regions of high curvature where the gradient changes quickly, smaller step sizes help avoid overshooting the optimal solution or diverging; in regions of low curvature, larger step sizes can safely accelerate progress. While computing curvature directly is often too computationally expensive for deep learning, it forms the theoretical foundation for designing advanced adaptive optimization algorithms that automatically adjust their learning rates.

Objective Function Curvature

Goodfellow, I., Bengio, Y., & Courville, A. (2016). $\mathit{Deep \ Learning.}$ MIT Press. Retrieved from [www.deeplearningbook.org](https://www.deeplearningbook.org) 

Deep Learning

Dive into Deep Learning

A helpful website for understanding gradient descent:
https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Reference

Gradient descent is used to properly update parameters in order to maximize the efficiency of linear regressions. Below is a step-by-step example of how to implement gradient descent.

Linear Regression and Gradient Descent

You can check your derivative computation to make sure that your implementation of back propagation is correct. You want to consider BOTH the right hand side and the left hand side derivatives. By taking a two sided derivative, you can numerically verify whether or not the function g of theta is a correct implementation of the derivative of f.

Numerical Approximation of Gradients

This is the technique used to check that our implementation is correct. There are different formulas for gradient checking and one of those is two-sided form:
          $$\frac {J(\theta + \epsilon) - J(\theta - \epsilon)} {2\epsilon}$$
Common choice for $\epsilon$ is $10^{-7}$. You shouldn't use gradient checking for the whole training data as it can be slow.

Gradient Checking

https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Explained

Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much.

Consider the following "recliner chair" type of function(image below).

Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector, casuing the fail to find global optima. 


Why Gradient descent might fail?

DeepLearningAI. (2021, March 24). A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. YouTube. https://www.youtube.com/watch?v=06-AZXmwHjo

A Chat with Andrew on MLOps: From Model-centric to Data-centric AI

Sagar, R. (2021, April 6). Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric. Analytics India Magazine. https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/

Big Data to Good Data: Andrew Ng Urges ML Community To Be More Data-Centric and Less Model-Centric

Andrew Ng believes that for datasets with fewer than 10,000 data points, machine learning teams can make faster progress by focusing on improving data than improving code.

MLOps: Data-centric and Model-centric approaches

Points where the derivative of a function is 0 are known as critical points, or stationary points.

The derivative also has to be zero according to every possible directional derivative at that point in higher dimensional functions.

$$\triangledown f = 0$$

Critical Points

When the method of steepest descent is applied to a quadratic cost surface, it creates a zig-zag pattern where each line search direction is orthogonal to its previous line search direction. Let $$d_{t-1}$$ be the previous search direction. At the minimum along this line, we will find that: $$\nabla _\theta f(\theta) \cdot d_{t-1} = 0$$ This implies that the new steepest descent direction $$d_t = \nabla _\theta f(\theta)$$ is orthogonal to $$d_{t-1}$$.

Method of Steepest Descent

Unlike first-order methods, second-order gradient methods use second-order derivatives. This improves optimization. 
- The most widely used second-order gradient method is the Newton method

Second-Order Gradient Methods

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function: Batch Gradient Descent, Stochastic Gradient Descent and Mini-batch Gradient Descent. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. 

Reference: https://ruder.io/optimizing-gradient-descent/

Gradient Descent Variants

Gradient descent methods use the slope of the surface. This will not necessarily point directly towards the extreme point. Local steepest direction may not be the same with the global optimum direction. 

Notes about gradient descent

Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

The training of BERT models follows a standard iterative optimization procedure used for deep neural networks. First, a large collection of training data is gathered. During each iteration, a random batch of these samples is selected, and the cumulative loss, $$\mathrm{Loss}_{\mathrm{BERT}}$$, is computed over the batch. Next, the model's parameters are updated to minimize this loss using an optimization algorithm like gradient descent or one of its variants. This cycle continues until a specific stopping condition is met, such as the convergence of the training loss.

BERT Training Process

If all parameters of a hidden layer are initialized to a constant, such as $$\mathbf{W}^{(1)} = c$$, every hidden unit will receive the same inputs and parameters, producing identical activations during forward propagation. Consequently, during backpropagation, the gradients of the output with respect to the parameters $$\mathbf{W}^{(1)}$$ will all take the exact same value. Because gradient-based algorithms like minibatch stochastic gradient descent update the parameters using these uniform gradients, all elements of $$\mathbf{W}^{(1)}$$ will continue to have identical values after every iteration. The hidden layer will thus behave as if it has only a single unit, failing to realize the network's expressive power.

The Problem with Constant Initialization

Assuming a sufficiently smooth objective function $$f$$ is Lipschitz continuous with constant $$L$$ (meaning that for any $$\mathbf{x}$$ and $$\mathbf{y}$$, the objective satisfies $$|f(\mathbf{x}) - f(\mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|$$), the change in the objective value after a gradient descent update $$\mathbf{x} \gets \mathbf{x} - \eta \mathbf{g}$$ is bounded by the inequality $$|f(\mathbf{x}) - f(\mathbf{x} - \eta\mathbf{g})| \leq L \eta\|\mathbf{g}\|$$. This bound demonstrates that the maximum change in the loss during a single step is constrained by the learning rate $$\eta$$, the gradient norm $$\|\mathbf{g}\|$$ , and the Lipschitz constant $$L$$. A small value for this upper bound presents a trade-off: it limits the speed at which the objective value can be reduced, but it advantageously limits how much progress can go wrong or be undone in any single gradient step.

Objective Function Change Bounds in Gradient Descent

One-dimensional gradient descent provides a clear illustration of why moving in the negative gradient direction reduces the objective function. For a continuously differentiable function $$f: \mathbb{R} ightarrow \mathbb{R}$$, the first-order Taylor expansion gives $$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)$$. Setting the step as $$\epsilon = -\eta f'(x)$$, where $$\eta > 0$$ is a fixed learning rate, yields $$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x))$$. When the derivative $$f'(x) 
eq 0$$, the term $$\eta f'^2(x) > 0$$ guarantees a decrease in $$f$$, provided $$\eta$$ is small enough for the higher-order terms to be negligible. This leads to the update rule $$x \leftarrow x - \eta f'(x)$$, which is applied iteratively from an initial value until a stopping condition is met, such as when the gradient magnitude $$|f'(x)|$$ becomes sufficiently small or a maximum number of iterations is reached.

One-Dimensional Gradient Descent

When the objective function maps a $$d$$-dimensional vector $$\mathbf{x} = [x_1, x_2, \ldots, x_d]^	op$$ to a scalar, i.e., $$f: \mathbb{R}^d 	o \mathbb{R}$$, its gradient becomes a vector of $$d$$ partial derivatives:

$$
abla f(\mathbf{x}) = \left[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}ight]^	op$$

Each component $$\partial f(\mathbf{x})/\partial x_i$$ captures the rate at which $$f$$ changes with respect to $$x_i$$ alone. Using the first-order multivariate Taylor expansion,

$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^	op 
abla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2)$$

one can show that the steepest-descent direction (up to second-order terms) is the negative gradient $$-
abla f(\mathbf{x})$$. Choosing a suitable learning rate $$\eta > 0$$ yields the multivariate gradient descent update rule:

$$\mathbf{x} \leftarrow \mathbf{x} - \eta 
abla f(\mathbf{x})$$

This directly generalizes the scalar update $$x \leftarrow x - \eta f'(x)$$ to vector-valued parameters.

Multivariate Gradient Descent

Second-Order Optimization Algorithm

In deep learning, the objective function $$f(\mathbf{x})$$ is typically formulated as the average of the individual loss functions $$f_i(\mathbf{x})$$ across the $$n$$ examples in the training dataset, where $$\mathbf{x}$$ is the parameter vector. This formulation is given by: $$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).$$ Consequently, the full gradient of the objective function at $$\mathbf{x}$$ is the average of the gradients for each example: $$
abla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n 
abla f_i(\mathbf{x}).$$

Average Objective Function in Deep Learning

Accelerated gradient methods, such as gradient descent with momentum, are a class of optimization algorithms that average over past gradients to obtain more stable directions of descent. They are particularly effective for solving ill-conditioned optimization problems, where the objective function landscape resembles a narrow canyon and progress in certain directions is much slower than in others.

Accelerated Gradient Methods

Assuming that the error function is $$J(w)$$ with one parameter $$w$$, to minimize the error, we can update the weight $$w$$ as follows:

$$w = w - \alpha \cdot \frac{dJ(w)}{dw}$$

where $$\alpha$$ is the learning rate, and $$\frac{dJ(w)}{dw}$$ is the derivative of $$J(w)$$ with respect to $$w$$. If the error function has two or more parameters, for example, a weight $$w$$ and a bias $$b$$, we can update them one by one:

$$w = w - \alpha \cdot \frac{\partial J(w,b)}{\partial w}$$

$$b = b - \alpha \cdot \frac{\partial J(w,b)}{\partial b}$$

where $$\partial$$ denotes the partial derivative.

Batch Gradient Descent Update Formula

Distributed deep learning training is an approach used when a single processor or GPU lacks the computational capacity or memory to process large amounts of training data. By distributing the workload across multiple processors, optimization algorithms like stochastic gradient descent can aggregate computations. For example, training across 1,024 GPUs with a small minibatch size of 32 per GPU results in an aggregate minibatch of 32,000 observations, dramatically accelerating training times for massive neural networks.

Distributed Deep Learning Training

Denote an objective function as $$f(x)$$, with $$g$$ as the gradient and $$H$$ as the Hessian matrix evaluated at an initial point $$x^{(0)}$$. In gradient descent, we calculate the updated point as $$x = x^{(0)} - \epsilon g$$, where $$\epsilon$$ is the step size. Using a second-order Taylor expansion, we obtain the approximation $$f(x^{(0)} - \epsilon g) \approx f(x^{(0)}) - \epsilon g^Tg + \frac{1}{2} \epsilon^2 g^THg$$. According to this equation, when $$g^THg$$ is positive, the optimal step size $$\epsilon^*$$ that minimizes this approximation is $$\epsilon^* = \frac{g^Tg}{g^THg}$$.

Optimal Step Size for Gradient Descent via Taylor Expansion

Gradient descent is an algorithm used to find the minimum value of a function. It begins by randomly selecting an initial parameter combination. It then iteratively calculates the gradients and updates the parameters in the direction that reduces the cost function the most, continuing until it converges to a minimum.

Gradient Descent Intuition

The second derivative of a function in a specific direction, represented by a unit vector $$d$$, is given by $$d^T H d$$, where $$H$$ is the Hessian matrix of the function.

Second Derivative in a Specific Direction

When a model has millions of parameters, the full Hessian matrix is computationally expensive to calculate and store. Krylov methods offer an alternative optimization approach by only requiring the product between the Hessian and an arbitrary vector. For a function $$f:\mathbb{R}^n\rightarrow \mathbb{R}$$ with a Hessian $$\mathbf{H}$$ and an arbitrary vector $$v$$, this Hessian-vector product can be evaluated using only gradient operations: $$\mathbf{H}v=\nabla_{\mathbf{x}}[(\nabla_{\mathbf{x}}f(x))^{\top}v]$$.

Hessian-Vector Product Formula

The cross-entropy loss function works very well for models that predict binary classes (aka the output is between 0 and 1). It is defined as -[y*log(y-hat) +(1-y)*log(1-(y-hat))]. If y=0 the left side of the function is dropped and the right side, -log(1-(y-hat)), is used. Otherwise if y=1 the right side of the function is dropped and it uses -log(y-hat). In both instances this loss function encourages probabilities that are close to the true probability. 

Cross-entropy loss

To train the parameters W and B of the logistic regression model, you need to define a cost function.

$$J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$$

$$=-\frac{1}{m} \sum_{i=1}^{m} [y^{(i)}log(\hat{y}^{(i)}) + (1 - y^{(i)})log(1 - \hat{y}^{(i)})]$$

This loss function is Convex.

Logistic Regression Cost Function

A machine learning model is being trained for a prediction task. A key metric, the objective function, is tracked over time. The value of this function represents the magnitude of the model's error. A graph of this process shows the objective function's value consistently decreasing as the number of training iterations increases. What is the most accurate interpretation of this trend?

An engineer is training a model to predict housing prices. After running the training process for several hours, they plot the value of the model's error measurement over time. They observe that the error value remains very high and does not decrease, staying almost flat throughout the entire process. Based on this observation, analyze the effectiveness of the training process and explain what this trend indicates about the model's ability to achieve its primary goal.

Diagnosing Model Training Issues

A machine learning model is designed to predict the price of a product. For a small sample of three products, the model predicts prices of [$55, $90, $125], while the actual prices are [$50, $100, $120]. The model's performance is measured by an objective function defined as the average of the squared differences between the predicted and actual values. First, calculate the value of this objective function for the given sample. Second, explain what a lower value of this function would signify about the model's future predictions.

Calculating and Interpreting a Model's Objective Function

When a primary objective function—such as the error rate in classification—is difficult to optimize directly due to non-differentiability or other mathematical complications, machine learning models instead optimize a surrogate objective. This proxy function is chosen because it is easier to compute gradients for while still aligning with the ultimate goal.

Surrogate Objective

A loss function is a specific type of objective function where the convention is that lower values indicate better model performance. Optimization algorithms actively seek to minimize the loss function to improve the model; any objective where higher is better can be converted into a loss function by simply flipping its sign.

Loss Function

A fundamental requirement for training modern machine learning and deep learning models is the use of differentiable objectives. Because the optimization process typically relies on gradient-based methods, such as minibatch stochastic gradient descent, the objective function (or loss function) must be mathematically differentiable with respect to the model's parameters. This differentiability allows the optimization algorithm to compute gradients, which provide the direction and magnitude of the parameter updates needed to minimize the error and improve the model's predictive performance.

Differentiable Objectives

A convex quadratic objective function is defined by the general mathematical form $$h(\mathbf{x}) = \frac{1}{2} \mathbf{x}^	op \mathbf{Q} \mathbf{x} + \mathbf{x}^	op \mathbf{c} + b$$, where the matrix $$\mathbf{Q}$$ is positive definite ($$\mathbf{Q} \succ 0$$). Because $$\mathbf{Q}$$ possesses strictly positive eigenvalues, this function has a unique global minimizer located at $$\mathbf{x}^* = -\mathbf{Q}^{-1} \mathbf{c}$$. The function can be rewritten by centering it around this minimizer, yielding $$h(\mathbf{x}) = \frac{1}{2} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})^	op \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c}) + b - \frac{1}{2} \mathbf{c}^	op \mathbf{Q}^{-1} \mathbf{c}$$. Furthermore, its gradient is given by $$\partial_{\mathbf{x}} h(\mathbf{x}) = \mathbf{Q} (\mathbf{x} - \mathbf{Q}^{-1} \mathbf{c})$$, which geometrically represents the distance from the point $$\mathbf{x}$$ to the minimizer scaled by the curvature matrix $$\mathbf{Q}$$.

Convex Quadratic Objective Function

Identifying an Objective Function Problem

Improving the Search Algorithm

An objective or scoring function can be the source of an inference failure when it does not assign a _____ score to the correct output than to the system output.

Learn Before

Related

Learn After