1Cademy - An AI team is fine-tuning a language model to write compelling short stories. The model generates a story one token at a time. However, they find the models outputs are becoming repetitive and nonsensical. Their current process involves having a reward model evaluate the entire 500-token story only after it is fully completed, providing a single quality score at the very end. Which of the following best explains why this training setup is failing?

Learn Before

Reinforcement Learning Process for LLMs

Multiple Choice

An AI team is fine-tuning a language model to write compelling short stories. The model generates a story one token at a time. However, they find the model's outputs are becoming repetitive and nonsensical. Their current process involves having a reward model evaluate the entire 500-token story only after it is fully completed, providing a single quality score at the very end. Which of the following best explains why this training setup is failing?

Updated 2025-10-05

Contributors are:

Who are from:

Learn Before

Related