1Cademy - Loss Masking via Forward and Backward Passes in SFT

Learn Before

Concept

Loss Masking via Forward and Backward Passes in SFT

In Supervised Fine-Tuning (SFT), training can be implemented using standard Large Language Models by concatenating the input $\mathbf{x}$ and target output $\mathbf{y}$ into a single sequence $\mathrm{seq}_{\mathbf{x},\mathbf{y}}$ . During the forward pass, the model processes the entire sequence as usual. Then, during the backward pass, the loss corresponding to the input tokens $\mathbf{x}$ is forced to zero (masked). This focuses the loss computation and subsequent parameter updates solely on the conditional probability of the output tokens, $\log \mathrm{Pr}_{\theta}(\mathbf{y}|\mathbf{x})$ , while effectively setting the term $\log \mathrm{Pr}_{\theta}(\mathbf{x})$ to ${}0$ .

Updated 2026-05-01

Contributors are:

Who are from:

References

Learn Before

Related

Learn After