1Cademy - A team is improving a text-generation model. The process involves providing the model with an input prompt, to which the model generates a textual response. A human evaluator then assigns a numerical score to this response based on its quality. This score is used to adjust the models behavior for future responses. If this entire process is described using the framework of a system learning from sequential decisions, what component of the text-generation process corresponds to the policy?

Learn Before

Bridging Language Modeling and Reinforcement Learning Notations in RLHF

Multiple Choice

A team is improving a text-generation model. The process involves providing the model with an input prompt, to which the model generates a textual response. A human evaluator then assigns a numerical score to this response based on its quality. This score is used to adjust the model's behavior for future responses. If this entire process is described using the framework of a system learning from sequential decisions, what component of the text-generation process corresponds to the 'policy'?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related