Combined Loss Objective in Weak-to-Strong Training
When fine-tuning a large model with supervision from a weaker model, the training objective can be a composite loss function. This function often combines a Knowledge Distillation (KD) loss, which encourages the large model to imitate the weak model's outputs, with a standard Language Model (LM) loss. The LM loss is calculated against ground-truth labels, allowing the large model to learn from both the weak supervisor's generalized knowledge and high-quality annotated data simultaneously.

0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Combined Loss Objective in Weak-to-Strong Training
A team is fine-tuning a large, powerful model to perform a specific task. Instead of using a dataset with pre-defined correct answers, they use a smaller, weaker model as a live supervisor. For each input, the large model generates an output, and the weaker model also generates an output. A loss value is then calculated based on the difference between these two outputs. What is the direct and immediate purpose of this calculated loss value within the training loop?
Transferring a Specialized Skill
Learn After
Diagnosing a Performance Plateau in Supervised Fine-Tuning
A team is fine-tuning a large language model. They have access to a small, high-quality dataset with verified ground-truth labels, as well as a much larger dataset where labels have been generated by a weaker, smaller model. To maximize the performance of the large model by using both data sources simultaneously, which training objective should they implement?
Visual Diagram of Combined Loss Training for Weak-to-Strong Generalization
Rationale for a Hybrid Training Objective
A research team is fine-tuning a large language model using a combined loss objective, which includes both a standard language model (LM) loss against ground-truth data and a knowledge distillation (KD) loss from a weaker supervisor model. They observe that while the large model is very good at mimicking the style and general structure of the weak supervisor's outputs, it frequently makes factual errors that are not present in the ground-truth dataset. Which of the following is the most likely cause of this issue and the best corrective action?