Generalization of the Language Modeling Concept
Alongside the rise of the Transformer architecture, the concept of language modeling was generalized to encompass models that learn to predict words in various ways, rather than strictly predicting the next token in a sequence. Many powerful Transformer-based models were pre-trained using these diverse word prediction tasks and successfully applied to a wide variety of downstream tasks.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Types of Language Models
Evaluating language models
Shannon's Foundational Work on Language Modeling
Generalization of the Language Modeling Concept
Chain Rule for Sequence Probability
Deep Learning Approach to Language Modeling
Output Token Sequence in LLMs
Start of Sentence (SOS) Token
[CLS] Token as a Start Symbol
A system is designed to predict the probability of a sequence of words. For the sequence 'The dog ran', the system provides the following conditional probabilities:
- The probability of 'The' occurring at the start of a sequence is 0.2.
- The probability of 'dog' occurring after 'The' is 0.3.
- The probability of 'ran' occurring after 'The dog' is 0.7.
Based on the fundamental principle used by such systems to determine the likelihood of a full sequence, what is the overall probability of the sequence 'The dog ran'?
Analyzing Language Model Probability Assignments
A system's primary goal is to predict the probability of a sequence of tokens. To calculate the total probability for the sequence 'The quick brown fox', it breaks the problem down into a series of conditional probability calculations. Arrange the following calculations in the correct order that the system would use to find the total probability of the sequence.
Evaluating a Language Model's Probabilistic Output
Self-attention layers' first approach
Transformers in contextual generation and summarization
Huggingface Model Summary
A Survey of Transformers (Lin et. al, 2021)
Overview of a Transformer
Model Usage of Transformers
Attention in vanilla Transformers
Transformer Variants (X-formers)
The Pre-training and Fine-tuning Paradigm
Architectural Categories of Pre-trained Transformers
Computational Cost of Self-Attention in Transformers
Quadratic Complexity's Impact on Transformer Inference Speed
Pre-Norm Architecture in Transformers
Critique of the Transformer Architecture's Core Limitation
A research team is building a model to summarize extremely long scientific papers. They are comparing two distinct architectural approaches:
- Approach 1: Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next.
- Approach 2: Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context.
Which of the following statements best analyzes the primary trade-off between these two approaches for this specific task?
Architectural Design Choice for Machine Translation
Enablers of Universal Language Capabilities
Model Depth in Transformers
Generalization of the Language Modeling Concept
Transformer Block Sub-Layers
Standard Optimization Objective for Transformer Language Models