A small startup is developing a model to classify customer reviews as positive or negative. Their dataset contains 50,000 reviews, and the model architecture is simple enough to be trained on a single high-end graphics card in under three hours. The lead engineer decides to spend a week configuring a system to train the model across ten separate machines, arguing it will accelerate the process. Evaluate the engineer's decision. Is this an effective use of resources? Justify your answer by explaining the primary trade-offs involved in using a multi-processor approach.

Evaluating a Training Strategy

A research team is training a language model with hundreds of billions of parameters on a dataset that is several terabytes in size. They find that training on their most powerful single processing unit would take several years to complete. Which statement best analyzes the core motivation for implementing a distributed training strategy in this scenario?

Match each distributed training scenario with the primary challenge it is designed to address.

Although sequence parallelism is primarily focused on handling long sequence modeling, much of its fundamental motivation stems from the distributed training methods used for deep networks. Because of this shared foundation, the implementation of sequence parallelism can often be built upon the same parallel processing libraries that were designed for distributed training.

Motivation for Sequence Parallelism

Distributed training is an approach used when a single processor or GPU lacks the computational capacity or memory to process large amounts of training data. By distributing the workload across multiple processors, optimization algorithms like stochastic gradient descent can aggregate computations. For example, training across $$1,024$$ GPUs with a small minibatch size of $$32$$ per GPU results in an aggregate minibatch of $$32,000$$ observations, dramatically accelerating training times for massive neural networks.

Google

Claude

Gradient descent is a fundamental optimization algorithm that leverages gradients to minimize a model's loss function. Because the gradient of a function points in the direction of steepest ascent, moving the model's parameters in the opposite direction iteratively lowers the loss. Each step of such gradient-based optimization algorithms requires calculating the exact gradient of the loss with respect to the parameters.

Gradient Descent

Reference of Foundations of Large Language Models Course

Dive into Deep Learning

A helpful website for understanding gradient descent:
https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Reference

Gradient descent is used to properly update parameters in order to maximize the efficiency of linear regressions. Below is a step-by-step example of how to implement gradient descent.

Linear Regression and Gradient Descent

You can check your derivative computation to make sure that your implementation of back propagation is correct. You want to consider BOTH the right hand side and the left hand side derivatives. By taking a two sided derivative, you can numerically verify whether or not the function g of theta is a correct implementation of the derivative of f.

Numerical Approximation of Gradients

This is the technique used to check that our implementation is correct. There are different formulas for gradient checking and one of those is two-sided form:
          $$\frac {J(\theta + \epsilon) - J(\theta - \epsilon)} {2\epsilon}$$
Common choice for $\epsilon$ is $10^{-7}$. You shouldn't use gradient checking for the whole training data as it can be slow.

Gradient Checking

Assuming that the error function is $J(w)$ with one parameter $w$, to minimize the error, we can update the weight $w$ as follows.
$$w = w - \alpha * \frac{dJ(w)}{dw}$$
, where $\alpha$ is a learning rate, and $\frac{dJ(w)}{dw}$ is the derivative of $J(w)$ with respect to $w$.

If the error function has two or more parameters, for example, a weight $w$ and a bias $b$, we can update them one by one.
$$w = w - \alpha * \frac{\partial J(w,b)}{\partial w}$$
$$b = b - \alpha * \frac{\partial J(w,b)}{\partial b}$$
, where $\partial$ is a stylish cursive $d$, denoting the partial derivatives.

(Batch) Gradient Descent (Deep Learning Optimization Algorithm)

https://towardsdatascience.com/gradient-descent-explained-9b953fc0d2c

Gradient Descent Explained

Gradient Descent is an algorithm which is designed to find the optimal points, but these optimal points are not necessarily global. And yes if it happens that it diverges from a local location it may converge to another optimal point but its probability is not too much.

Consider the following "recliner chair" type of function(image below).

Obviously, this can be constructed so that there is a range in the middle where the gradient is the 0 vector, casuing the fail to find global optima. 


Why Gradient descent might fail?

DeepLearningAI. (2021, March 24). A Chat with Andrew on MLOps: From Model-centric to Data-centric AI. YouTube. https://www.youtube.com/watch?v=06-AZXmwHjo

A Chat with Andrew on MLOps: From Model-centric to Data-centric AI

Sagar, R. (2021, April 6). Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric. Analytics India Magazine. https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/

Big Data to Good Data: Andrew Ng Urges ML Community To Be More Data-Centric and Less Model-Centric

Andrew Ng believes that for datasets with fewer than 10,000 data points, machine learning teams can make faster progress by focusing on improving data than improving code.

MLOps: Data-centric and Model-centric approaches

Points where the derivative of a function is 0 are known as critical points, or stationary points.

The derivative also has to be zero according to every possible directional derivative at that point in higher dimensional functions.

$$\triangledown f = 0$$

Critical Points

Optimization algorithms that use only the gradient, such as gradient descent,are called ﬁrst-order optimization algorithms.

First-order Optimization Algorithm

When the method of steepest descent is applied to a quadratic cost surface, it creates a zig-zag pattern that each line search direction is orthogonal to its previous line search direction.
Let $d_{t-1}$ be the previous search direction. At the minimum, we will find that:
$$\nabla _\theta f(\theta) \cdot d_{t-1} = 0$$
Then we know $d_t = \nabla _\theta f(\theta)$ is orthogonal to $d_{t-1}$

Method of Steepest Descent

Unlike first-order methods, second-order gradient methods use second-order derivatives. This improves optimization. 
- The most widely used second-order gradient method is the Newton method

Second-Order Gradient Methods

Gradient descent is an algorithm used to find the minimum value of a function. We will use the gradient descent algorithm to find the cost function. The idea of the algorithm is to randomly select a parameter combination at the beginning, calculate, and then find the next parameter combination that can reduce the cost function value the most, and continue to a minimum value.

https://machinelearningmastery.com/gradient-descent-for-machine-learning/

Gradient Descent Explanation

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the objective function: Batch Gradient Descent, Stochastic Gradient Descent and Mini-batch Gradient Descent. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update. 

Reference: https://ruder.io/optimizing-gradient-descent/

Gradient Descent Variants

Gradient descent methods use the slope of the surface. This will not necessarily point directly towards the extreme point. Local steepest direction may not be the same with the global optimum direction. 

Notes about gradient descent

Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

In a neural network with many time steps or layers, a gradient at the early layer is the product of all the terms from the later layers, which leads to an inherently unstable situation. Especially when the value of gradient has become so small, it no longer updates properly or is vanished eventually. Exploding gradient can be considered as the opposite of vanishing process. The updated weights using gradient descent become so large that they cause the whole network to become unstable, which leads to numerical overflow.

Vanishing/exploding gradient

The training of BERT models follows a standard iterative optimization procedure used for deep neural networks. First, a large collection of training data is gathered. During each iteration, a random batch of these samples is selected, and the cumulative loss, $$\mathrm{Loss}_{\mathrm{BERT}}$$, is computed over the batch. Next, the model's parameters are updated to minimize this loss using an optimization algorithm like gradient descent or one of its variants. This cycle continues until a specific stopping condition is met, such as the convergence of the training loss.

BERT Training Process

An objective or scoring function can be the source of an inference failure when it does not assign a higher score to the correct output than to the system output. In that case, the learning algorithm that estimates the score should be improved rather than the search algorithm.

Objective Function

Distributed Training

If all parameters of a hidden layer are initialized to a constant, such as $$\mathbf{W}^{(1)} = c$$, every hidden unit will receive the same inputs and parameters, producing identical activations during forward propagation. Consequently, during backpropagation, the gradients of the output with respect to the parameters $$\mathbf{W}^{(1)}$$ will all take the exact same value. Because gradient-based algorithms like minibatch stochastic gradient descent update the parameters using these uniform gradients, all elements of $$\mathbf{W}^{(1)}$$ will continue to have identical values after every iteration. The hidden layer will thus behave as if it has only a single unit, failing to realize the network's expressive power.

The Problem with Constant Initialization

Assuming a sufficiently smooth objective function $$f$$ is Lipschitz continuous with constant $$L$$ (meaning that for any $$\mathbf{x}$$ and $$\mathbf{y}$$, the objective satisfies $$|f(\mathbf{x}) - f(\mathbf{y})| \leq L \|\mathbf{x} - \mathbf{y}\|$$), the change in the objective value after a gradient descent update $$\mathbf{x} \gets \mathbf{x} - \eta \mathbf{g}$$ is bounded by the inequality $$|f(\mathbf{x}) - f(\mathbf{x} - \eta\mathbf{g})| \leq L \eta\|\mathbf{g}\|$$. This bound demonstrates that the maximum change in the loss during a single step is constrained by the learning rate $$\eta$$, the gradient norm $$\|\mathbf{g}\|$$ , and the Lipschitz constant $$L$$. A small value for this upper bound presents a trade-off: it limits the speed at which the objective value can be reduced, but it advantageously limits how much progress can go wrong or be undone in any single gradient step.

Objective Function Change Bounds in Gradient Descent

One-dimensional gradient descent provides a clear illustration of why moving in the negative gradient direction reduces the objective function. For a continuously differentiable function $$f: \mathbb{R} ightarrow \mathbb{R}$$, the first-order Taylor expansion gives $$f(x + \epsilon) = f(x) + \epsilon f'(x) + \mathcal{O}(\epsilon^2)$$. Setting the step as $$\epsilon = -\eta f'(x)$$, where $$\eta > 0$$ is a fixed learning rate, yields $$f(x - \eta f'(x)) = f(x) - \eta f'^2(x) + \mathcal{O}(\eta^2 f'^2(x))$$. When the derivative $$f'(x) 
eq 0$$, the term $$\eta f'^2(x) > 0$$ guarantees a decrease in $$f$$, provided $$\eta$$ is small enough for the higher-order terms to be negligible. This leads to the update rule $$x \leftarrow x - \eta f'(x)$$, which is applied iteratively from an initial value until a stopping condition is met, such as when the gradient magnitude $$|f'(x)|$$ becomes sufficiently small or a maximum number of iterations is reached.

One-Dimensional Gradient Descent

When the objective function maps a $$d$$-dimensional vector $$\mathbf{x} = [x_1, x_2, \ldots, x_d]^	op$$ to a scalar, i.e., $$f: \mathbb{R}^d 	o \mathbb{R}$$, its gradient becomes a vector of $$d$$ partial derivatives:

$$
abla f(\mathbf{x}) = \left[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_d}ight]^	op$$

Each component $$\partial f(\mathbf{x})/\partial x_i$$ captures the rate at which $$f$$ changes with respect to $$x_i$$ alone. Using the first-order multivariate Taylor expansion,

$$f(\mathbf{x} + \boldsymbol{\epsilon}) = f(\mathbf{x}) + \boldsymbol{\epsilon}^	op 
abla f(\mathbf{x}) + \mathcal{O}(\|\boldsymbol{\epsilon}\|^2)$$

one can show that the steepest-descent direction (up to second-order terms) is the negative gradient $$-
abla f(\mathbf{x})$$. Choosing a suitable learning rate $$\eta > 0$$ yields the multivariate gradient descent update rule:

$$\mathbf{x} \leftarrow \mathbf{x} - \eta 
abla f(\mathbf{x})$$

This directly generalizes the scalar update $$x \leftarrow x - \eta f'(x)$$ to vector-valued parameters.

Multivariate Gradient Descent

First-order optimization algorithms rely solely on the value and gradient of the objective function. In contrast, second-order optimization algorithms also utilize information about the function's curvature, often represented by the Hessian matrix. By accounting for curvature, these methods can automatically adjust the optimization step, providing a way to circumvent the difficulties of manually tuning a learning rate.

Second-Order Optimization Algorithm

In deep learning, the objective function $$f(\mathbf{x})$$ is typically formulated as the average of the individual loss functions $$f_i(\mathbf{x})$$ across the $$n$$ examples in the training dataset, where $$\mathbf{x}$$ is the parameter vector. This formulation is given by: $$f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n f_i(\mathbf{x}).$$ Consequently, the full gradient of the objective function at $$\mathbf{x}$$ is the average of the gradients for each example: $$
abla f(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n 
abla f_i(\mathbf{x}).$$

Average Objective Function in Deep Learning

Accelerated gradient methods, such as gradient descent with momentum, are a class of optimization algorithms that average over past gradients to obtain more stable directions of descent. They are particularly effective for solving ill-conditioned optimization problems, where the objective function landscape resembles a narrow canyon and progress in certain directions is much slower than in others.

Learn Before

Related

Learn After