1Cademy - Dynamic State in LLM Policy

Learn Before

Policy Formula for LLMs in Reinforcement Learning

Short Answer

Dynamic State in LLM Policy

An autoregressive language model is generating a response. Its policy for choosing the next word is defined by the formula: π(a|s) = Pr(y_t | x, y_<t). Explain how the 'state' (s) changes from one token generation step to the next, and describe why this change is fundamental to the model's ability to produce coherent text.

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related