Google

The notation $$\pi_\theta(y_t | \mathbf{x}, \mathbf{y}_{<t})$$ is often used to represent the policy of an autoregressive model. It denotes the conditional probability of selecting output $$y_t$$ at time step $$t$$, given an input context $$\mathbf{x}$$ and the sequence of previously generated outputs $$\mathbf{y}_{<t}$$. This policy is governed by the model's parameters $$\theta$$. This notation is functionally equivalent to the standard probability notation $$\Pr_\theta(y_t | \mathbf{x}, \mathbf{y}_{<t})$$.

Policy Notation for Autoregressive Models ($$\pi_\theta$$)

A language model, with parameters represented by θ, is translating the English sentence 'Hello, how are you?' into French. It has already generated the partial translation 'Bonjour, comment'. The model is now deciding the next word. What does the expression `π_θ('allez' | 'Hello, how are you?', 'Bonjour, comment')` represent in this context?

Match each component of the policy notation `π_θ(y_t | X, y_<t)` to its correct description in the context of an autoregressive language model.

A data scientist is building a model to predict a house's sale price. The available data for each house includes its square footage, number of bedrooms, and zip code. They propose to represent the model's prediction process using the notation π_θ(y_t | X, y_<t>), where X is the set of house features and y_t is a part of the final predicted price. Evaluate the appropriateness of using this specific notation for this task. Justify your reasoning.

Appropriateness of Autoregressive Notation

An objective function, denoted as $$U(\mathbf{x}, \mathbf{y}; \theta)$$, can be formulated to guide the training of a sequence generation model, often in the context of policy optimization. It is calculated by summing the weighted log-probabilities of the policy over each step of the generated sequence. The formula is $$U(\mathbf{x}, \mathbf{y}; \theta) = \sum_{t=1}^{T} A(\mathbf{x}, y_t, \mathbf{y}_{<t}) \log \pi_\theta(y_t|\mathbf{x}, \mathbf{y}_{<t})$$, where $$\pi_\theta$$ is the model's policy (a probability distribution over outputs) and $$A(\cdot)$$ is a function that assigns a weight or advantage to each step.

Learn Before

Related