1Cademy - Combined Loss Objective in Weak-to-Strong Training

Learn Before

Direct Supervision via Knowledge Distillation Loss in Weak-to-Strong Generalization

Concept

Combined Loss Objective in Weak-to-Strong Training

When fine-tuning a large model with supervision from a weaker model, the training objective can be a composite loss function. This function often combines a Knowledge Distillation (KD) loss, which encourages the large model to imitate the weak model's outputs, with a standard Language Model (LM) loss. The LM loss is calculated against ground-truth labels, allowing the large model to learn from both the weak supervisor's generalized knowledge and high-quality annotated data simultaneously.