Concept

Loss Masking via Forward and Backward Passes in SFT

In Supervised Fine-Tuning (SFT), training can be implemented using standard Large Language Models by concatenating the input x\mathbf{x} and target output y\mathbf{y} into a single sequence seqx,y\mathrm{seq}_{\mathbf{x},\mathbf{y}}. During the forward pass, the model processes the entire sequence as usual. Then, during the backward pass, the loss corresponding to the input tokens x\mathbf{x} is forced to zero (masked). This focuses the loss computation and subsequent parameter updates solely on the conditional probability of the output tokens, logPrθ(yx)\log \mathrm{Pr}_{\theta}(\mathbf{y}|\mathbf{x}), while effectively setting the term logPrθ(x)\log \mathrm{Pr}_{\theta}(\mathbf{x}) to 0{}0.

Image 0

0

1

Updated 2026-05-01

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related