Learn Before
Global Nature of Standard Transformer LLMs
Large language models that utilize the standard Transformer architecture function as global models. During inference, these models are required to store the complete left-context—the entire history of previously generated tokens—in order to predict future tokens. This comprehensive storage is managed through a Key-Value (KV) cache, which retains the key and value representations of all past tokens, resulting in a caching cost that progressively increases as the generation process continues.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Training Decoder-Only Language Models with Cross-Entropy Loss
Output Probability Calculation in Transformer Language Models
Global Nature of Standard Transformer LLMs
Processing Flow of Autoregressive Generation in a Decoder-Only Transformer
Initial Input Representation for Transformer Layers
Greedy Decoding in Language Models
Structure of a Transformer Block
A generative language model is designed to produce text by predicting the next token based solely on the sequence of tokens that came before it. If you were to adapt a standard Transformer decoder block for this specific auto-regressive task, which of its sub-layers would you remove, and why is this modification functionally necessary?
A language model is constructed using a stack of modified Transformer decoder blocks. Each block contains a self-attention sub-layer and a feed-forward network sub-layer, but lacks the sub-layer that would process information from a separate, secondary input sequence. This model is capable of performing a machine translation task, such as translating a German sentence into English, without any further architectural changes.
Function of Self-Attention in Auto-regressive Generation
Neural Network-Based Next-Token Probability Distribution
Learn After
Key-Value (KV) Cache in Transformer Inference
A language model using a standard Transformer architecture is generating a long sequence of text one token at a time. How does the computational effort required to generate the 500th token compare to the effort required for the 10th token?
Diagnosing Memory Issues in a Language Model
Difficulty of Training Transformers on Long Sequences
Evaluating Context Handling in Language Models
Explicit Context Encoding via Additional Memory Models