Comparison of Output Probability Meaning: Language Modeling vs. Encoder Pre-training
The interpretation of the output probability distribution, , differs significantly between standard language models and encoder pre-training contexts. In standard language modeling, which employs an auto-regressive decoding process, represents the probability of predicting the next word, given that the model only observes preceding tokens up to position . By contrast, during encoder pre-training, the model has simultaneous access to the entire input sequence at once, making it nonsensical to predict any of the observed tokens in the sequence.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Simplified Notation for Parameterized Models
Comparison of Output Probability Meaning: Language Modeling vs. Encoder Pre-training
A language model computes probability distributions for a sequence of tokens
xusing a two-stage process: an encoder with parametersθgenerates representations, which are then passed to a Softmax layer with a weight matrixW. This model is consistently outputting a nearly uniform probability distribution for every token position, meaning every word in the vocabulary is considered almost equally likely, regardless of the input. Which of the following is the most direct and plausible explanation for this behavior?Evaluating Component Independence in a Language Model
A language model calculates the probability distribution for each token in an input sequence,
x, by first generating a sequence of numerical representations and then applying a final transformation. Arrange the following steps in the correct computational order to produce the probability vector,p_i, for the token at a specific positioni.
Learn After
Interpreting Model Output Probabilities
An engineer is working with two different text-processing systems. System A generates a story one word at a time. To choose the word at position i, it calculates a probability distribution over the vocabulary based only on the words from position 1 to i-1. System B is used for a fill-in-the-blank task. Given a sentence with a missing word at position i, it calculates a probability distribution for that position using all other words in the sentence (both before and after position i) as context. Which statement best analyzes the meaning of the probability distributions in these two systems?
Consider a model tasked with predicting a masked word within a complete sentence by looking at all surrounding words. The probability distribution calculated for this masked position has the same fundamental interpretation as the distribution from a model that generates a sentence one word at a time, where each new word is predicted based only on the words that came before it.