Learn Before
Token-Level Loss Calculation in a Backward Pass
During the backward pass in training an autoregressive language model, the loss is calculated by comparing the model's predictions to the actual target tokens. A key aspect of this process is that the loss is computed only for the output or target portion of the sequence. For an input sequence like x1, x2, x3 and a target output y1, y2, the loss would be zero for the input tokens. Consequently, the gradients used to update the model's weights originate only from the positions of the target tokens (y1, y2), as these are the positions where a non-zero loss is calculated. These gradients are then propagated backward through the network to adjust the parameters.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.2 Generative Models - Foundations of Large Language Models
Related
Backpropagation Through Time (BPTT)
Back-Propagating through Discrete Stochastic Operations
Neural Network Learning Rate
Back-Propagation through Random Operations
Backward Propagation Formulation
True/False: During forward propagation, in the forward function for a layer ll you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During back propagation, the corresponding backward function also needs to know what is the activation function for layer ll, since the gradient depends on it.
Back Propagation Illustrated Example
A neural network is trained to distinguish between images of 'apples' and 'oranges'. During a training iteration, it is shown an image of an apple but predicts 'orange' with a high degree of certainty. This results in a significant error value. What is the primary computational goal of the backpropagation step that immediately follows this prediction?
Token-Level Loss Calculation in a Backward Pass
Consider a simple neural network with one input neuron, one hidden neuron, and one output neuron. The network has a weight
w1connecting the input to the hidden neuron, and a weightw2connecting the hidden neuron to the output neuron. After a forward pass, an error is calculated based on the network's final output. To updatew1using the backpropagation algorithm, you must calculate the partial derivative of the error with respect tow1. Which of the following components is essential for determining how much of the final error is attributable to the hidden neuron's activity?Allocating Gradient Memory
Chain Rule for Tensors
Storage of Intermediate Variables in Backpropagation
Learn After
An autoregressive language model is being trained on a single data instance. The model is provided with the input context tokens
['The', 'quick', 'brown']and is trained to generate the target completion tokens['fox', 'jumps']. During the backward pass for this specific training step, from which token positions will the error signals (gradients) used to update the model's weights primarily originate?Debugging Language Model Training
When fine-tuning an autoregressive language model on a dataset where each example consists of an input prompt and a target completion, the training loss is calculated across all tokens in the combined sequence (prompt + completion) to ensure the model understands the full context.
Example of Loss Calculation in Instruction Fine-Tuning