Short Answer

Applying Loss Masking in SFT

You are fine-tuning a language model for a question-answering task. A single training example is formed by concatenating the prompt and the response: [PROMPT_START] What is the capital of France? [PROMPT_END] [RESPONSE_START] Paris. [RESPONSE_END]. During the auto-regressive training process, the model calculates a loss value for its prediction at each token position. For which part of this sequence should the loss be calculated, and for which part should it be masked (ignored)? Briefly explain the reasoning behind this choice.

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science