1Cademy - A language model computes probability distributions for a sequence of tokens `x` using a two-stage process: an encoder with parameters `θ` generates representations, which are then passed to a Softmax layer with a weight matrix `W`. This model is consistently outputting a nearly uniform probability distribution for every token position, meaning every word in the vocabulary is considered almost equally likely, regardless of the input. Which of the following is the most direct and plausible explanation for this behavior?

Learn Before

Probability Distribution Formula for an Encoder-Softmax Language Model

Multiple Choice

A language model computes probability distributions for a sequence of tokens x using a two-stage process: an encoder with parameters θ generates representations, which are then passed to a Softmax layer with a weight matrix W. This model is consistently outputting a nearly uniform probability distribution for every token position, meaning every word in the vocabulary is considered almost equally likely, regardless of the input. Which of the following is the most direct and plausible explanation for this behavior?

Updated 2025-10-02

Contributors are:

Who are from:

Learn Before

Related