Probability Normalization over a Candidate Set
The conditional probability of a specific token, given a context, can be determined by normalizing its score against the scores of other tokens. This is achieved by dividing the score of the target token by the sum of the scores for all tokens within a defined candidate set . This method ensures the resulting probabilities for all tokens in the set sum to 1. The general formula is:
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Probability Normalization over a Candidate Set
An autoregressive model is given an input prompt,
x, which is the sequence 'The best movie I ever saw was'. The model has already generated the partial output sequence,y_{<i}, which is 'about a'. The model's next task is to predict the probability of the next token,y_i, based on the standard conditional probability notationPr(y_i|x, y_{<i}). What is the actual, full sequence of tokens the model uses as its context to make this prediction?In the context of autoregressive sequence generation, the notation
Pr(y_i|x, y_{<i})implies that the model treats the inputxand the previously generated tokensy_{<i}as two separate, distinct sources of information for predicting the next tokeny_i.Interpreting Autoregressive Model Inputs
Learn After
Conditional Probability Formula for Autoregressive Models using Softmax
A language model is predicting the next word in a sequence. After processing the context, it has assigned the following unnormalized scores to a set of four candidate words: 'mat' (score=6.0), 'rug' (score=3.0), 'floor' (score=0.5), and 'chair' (score=0.5). To convert these scores into a valid probability distribution over this set, what is the final probability assigned to the word 'mat'?
A language model is evaluating three candidate tokens (A, B, C) to follow a given context. Initially, their scores are: Token A = 4, Token B = 4, Token C = 2. If the score for Token C is increased to 12, while the scores for Token A and Token B remain unchanged, how does this affect the normalized probabilities of Token A and Token B?
Comparing Model Confidence via Probability Normalization
Softmax Function