Multiple Choice

During the policy optimization phase of training a large language model, the model is being rewarded for providing detailed explanations. The 'reference policy' is a version of the model that typically gives concise, direct answers. The current policy generates two possible responses to a user's query:

Response A: 'Yes.' Response B: 'Affirmative, the data you have presented aligns with the expected parameters, and therefore, the conclusion you have reached is indeed correct and validated.'

Assuming the reference policy would have a very high probability of generating Response A and a near-zero probability of generating Response B, which response would incur a larger penalty term designed to prevent deviation from the reference policy, and why?

0

1

Updated 2025-10-02

Contributors are:

Who are from:

Tags

Ch.4 Alignment - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

Related