Applying Loss Masking in SFT
You are fine-tuning a language model for a question-answering task. A single training example is formed by concatenating the prompt and the response: [PROMPT_START] What is the capital of France? [PROMPT_END] [RESPONSE_START] Paris. [RESPONSE_END]. During the auto-regressive training process, the model calculates a loss value for its prediction at each token position. For which part of this sequence should the loss be calculated, and for which part should it be masked (ignored)? Briefly explain the reasoning behind this choice.
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A machine learning engineer is fine-tuning a pre-trained language model to function as a helpful assistant. The training data consists of pairs of instructions and desired responses. For each pair, the instruction and response are combined into a single sequence, and the model is trained to predict the next token at each position. However, due to a configuration error, the training loss is calculated across the entire combined sequence (both the instruction and the response tokens), instead of only on the response tokens. What is the most likely undesirable outcome of this training setup?
Applying Loss Masking in SFT
Analyzing a Fine-Tuning Training Objective