1Cademy - A reward model for a generative text model calculates a quality score for a given output using the formula $r = \mathbf{h}_{\text{last}} \mathbf{W}_r$. In this formula, $\mathbf{h}_{\text{last}}$ is the vector representation of the final token in the generated text, and $\mathbf{W}_r$ is a learned weight matrix that transforms this vector into a scalar score, $r$. What is a primary conceptual limitation of this specific reward calculation method, especially when evaluating lengthy and complex text?

Learn Before

Reward Function as a Linear Transformation of the Last Hidden State

Multiple Choice

A reward model for a generative text model calculates a quality score for a given output using the formula $r = \mathbf{h}_{\text{last}} \mathbf{W}_r$ . In this formula, $\mathbf{h}_{\text{last}}$ is the vector representation of the final token in the generated text, and $\mathbf{W}_r$ is a learned weight matrix that transforms this vector into a scalar score, $r$ . What is a primary conceptual limitation of this specific reward calculation method, especially when evaluating lengthy and complex text?

Updated 2025-10-07

Contributors are:

Who are from:

Learn Before

Related