1Cademy - A team is refining a large language models conversational abilities. Their training process involves the model generating responses to a continuous stream of new prompts. After each response, a separate reward model provides a quality score. The language model is then immediately updated based on this score before it handles the next prompt. Which statement best characterizes the fundamental nature of this learning approach?

Learn Before

PPO as an Online Reinforcement Learning Method

Multiple Choice

A team is refining a large language model's conversational abilities. Their training process involves the model generating responses to a continuous stream of new prompts. After each response, a separate reward model provides a quality score. The language model is then immediately updated based on this score before it handles the next prompt. Which statement best characterizes the fundamental nature of this learning approach?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related