Visual Diagram of Combined Loss Training for Weak-to-Strong Generalization
This diagram illustrates a training process for a large model using a combined loss objective, a technique used in weak-to-strong generalization. In this setup, a large model takes an input 'x' from a dataset and produces an output 'y'. The model is trained by minimizing two separate loss functions simultaneously: 1) a standard Language Model (LM) Loss, which compares the model's output to ground-truth data, and 2) a Knowledge Distillation (KD) Loss, which is derived from a smaller, weaker 'teacher' model. These losses are combined in the 'Compute Loss & Train' step to update the large model's parameters.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Diagnosing a Performance Plateau in Supervised Fine-Tuning
A team is fine-tuning a large language model. They have access to a small, high-quality dataset with verified ground-truth labels, as well as a much larger dataset where labels have been generated by a weaker, smaller model. To maximize the performance of the large model by using both data sources simultaneously, which training objective should they implement?
Visual Diagram of Combined Loss Training for Weak-to-Strong Generalization
Rationale for a Hybrid Training Objective
A research team is fine-tuning a large language model using a combined loss objective, which includes both a standard language model (LM) loss against ground-truth data and a knowledge distillation (KD) loss from a weaker supervisor model. They observe that while the large model is very good at mimicking the style and general structure of the weak supervisor's outputs, it frequently makes factual errors that are not present in the ground-truth dataset. Which of the following is the most likely cause of this issue and the best corrective action?
Learn After
A large model is being trained using a combined objective. This objective includes a 'distillation loss,' which encourages the large model to mimic the outputs of a smaller, weaker 'teacher' model. It also includes a 'supervised loss,' which is calculated against a set of known correct answers (ground-truth). What is the primary analytical reason for including the 'supervised loss' in this training process?
A large model is being trained using a combined objective that incorporates signals from both ground-truth data and a smaller 'teacher' model. Based on a typical diagram of this process, arrange the following computational steps into the correct logical order for a single training update.
Diagnosing Training Imbalance