1Cademy - Consider a sequence generation model being trained with a policy gradient method. If the advantage function `A(x, y_<t, y_t)` returns a negative value for a specific token `y_t` at a given step, the training objective will encourage the model to increase the probability of selecting that token `y

Learn Before

Policy Gradient Utility for Sequence Generation

True/False

Consider a sequence generation model being trained with a policy gradient method. If the advantage function A(x, y_<t, y_t) returns a negative value for a specific token y_t at a given step, the training objective will encourage the model to increase the probability of selecting that token y_t in similar future situations.

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related