1Cademy - Final Hidden States in a Transformer Language Model

Learn Before

Processing Flow of Autoregressive Generation in a Decoder-Only Transformer

Definition

Final Hidden States in a Transformer Language Model

In a Transformer-based language model with $L$ layers, the final hidden states are the sequence of output vectors from the last Transformer block, denoted as $\{\mathbf{h}_0^L, \dots, \mathbf{h}_{m-1}^L\}$ . Each vector $\mathbf{h}_i^L$ represents the contextualized embedding of the $i$ -th token after processing through the entire stack of $L$ layers. This sequence of vectors encapsulates the model's final understanding of the input sequence and is used as the basis for subsequent predictions, such as generating logits for the next token.