Multiple Choice

An encoder-decoder model is trained on a denoising task. It receives a corrupted input like The quick [M] fox jumps [M] the lazy dog. and must generate the original, complete sentence The quick brown fox jumps over the lazy dog. The decoder generates the output one word at a time. Why is the training loss typically calculated for each word the decoder generates, rather than just a single loss for the entire completed sentence?

0

1

Updated 2025-09-26

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science