Explain why the strategy a language model uses to select the next token is described as a 'policy' that is a probability distribution over its entire vocabulary, rather than a function that simply selects a single, predetermined 'best' token.

Google

For a Large Language Model (LLM), the policy, denoted as $$\pi$$, represents the probability distribution over the vocabulary of possible next tokens, conditioned on the preceding sequence of tokens which constitute the context. In essence, it is the strategy the LLM uses to decide which token to generate next.

Policy in the Context of LLMs

The policy for a Large Language Model is formally expressed as a conditional probability. Specifically, the policy of taking an action $$a$$ in a state $$s$$, denoted as $$\pi(a|s)$$, is the probability of generating the next token $$y_t$$ given the input prompt $$\mathbf{x}$$ and the sequence of previously generated tokens $$\mathbf{y}_{< t}$$. The formula is: $$\pi(a|s) = \Pr(y_t | \mathbf{x}, \mathbf{y}_{< t})$$. In this formulation, the action $$a$$ corresponds to the predicted token $$y_t$$, and the state $$s$$ corresponds to the context sequence $$(\mathbf{x}, \mathbf{y}_{< t})$$.

Policy Formula for LLMs in Reinforcement Learning

An autoregressive language model has processed the input 'The cat sat on the' and is now deciding the next word to generate. At this specific step, which of the following best describes the model's 'policy'?

Based on the description of a language model's policy as a probability distribution over the vocabulary, analyze the behavior of Model A and Model B. Which model's generation process directly reflects this definition of a policy, and why is the other model's approach a more limited interpretation?

Learn Before

Related