A language model is trained on the text sequence: `Input: What is the capital of France? Output: The capital of France is Paris.` Explain why the model's training loss is calculated only on the `Output` portion of the sequence and not the `Input` portion.

Google

An example of a sequence divided into context (input) and prediction (output) parts can be seen in a simple question-answering task. For instance, the sequence can be structured as: `⟨s⟩ Square this number . 2 .` serving as the input context, followed by `The result is 4 .` as the output prediction. In this scenario, the loss would be calculated, and gradients back-propagated, based on the prediction part of the sequence.

Example of Context and Prediction Sub-sequences

A language model is being trained on the sequence: `⟨s⟩ Translate to Spanish: The cat sat. El gato se sentó. ⟨/s⟩`. To effectively teach the model how to perform the translation, on which part of the sequence should the training loss be calculated?

Based on the common practice of dividing sequences for model training, what is the fundamental error in the engineer's approach to calculating the loss? Explain how this error leads to the observed poor performance.

Learn Before

Related