A team is improving a text-generation model. The process involves providing the model with an input prompt, to which the model generates a textual response. A human evaluator then assigns a numerical score to this response based on its quality. This score is used to adjust the model's behavior for future responses. If this entire process is described using the framework of a system learning from sequential decisions, what component of the text-generation process corresponds to the 'policy'?
0
1
Tags
Ch.4 Alignment - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A text-generation model is being optimized to produce high-quality responses. The process starts with an input prompt. The model then generates a sequence of text. This generated text is passed to a separate automated scoring system, which outputs a single numerical value representing the response's quality. The model's internal configuration is then updated based on this score to improve its future outputs. Match each abstract component of a learning system (left column) to its concrete implementation in this text-generation scenario (right column).
LLM as the Agent in RLHF
A team is improving a text-generation model. The process involves providing the model with an input prompt, to which the model generates a textual response. A human evaluator then assigns a numerical score to this response based on its quality. This score is used to adjust the model's behavior for future responses. If this entire process is described using the framework of a system learning from sequential decisions, what component of the text-generation process corresponds to the 'policy'?
The Agent-Environment Interaction Loop in Reinforcement Learning
Agent-Environment Interaction Loop in Reinforcement Learning
Deconstructing a Model Training Interaction