Learn Before
Transformer
The Transformer is a deep learning architecture built exclusively on attention mechanisms, foregoing traditional recurrent or convolutional layers. A defining property of the Transformer is its superior scaling behavior: its performance consistently improves as the dataset size, model size, and computational budget increase. This architecture has become foundational, driving state-of-the-art results across natural language processing, computer vision, speech recognition, and reinforcement learning.
0
1
Tags
Data Science
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
D2L
Dive into Deep Learning @ D2L
Related
Neural Machine Translation by Jointly Learning to Align and Translate
Effective Approaches to Attention-based Neural Machine Translation
Attention Motivation
Example of how Attention is used in Machine Translation
The Illustrated Transformer
Attention Is All You Need
Attention is all you need; Attentional Neural Network Models | Łukasz Kaiser | Masterclass
Tensor2Tensor Intro
Transformer model
Transformer
Efficient Transformers: A Survey
Evaluation of Efficient Transformers
Learn After
Self-attention layers' first approach
Transformers in contextual generation and summarization
Huggingface Model Summary
A Survey of Transformers (Lin et. al, 2021)
Overview of a Transformer
Model Usage of Transformers
Attention in vanilla Transformers
Transformer Variants (X-formers)
The Pre-training and Fine-tuning Paradigm
Architectural Categories of Pre-trained Transformers
Computational Cost of Self-Attention in Transformers
Quadratic Complexity's Impact on Transformer Inference Speed
Pre-Norm Architecture in Transformers
Critique of the Transformer Architecture's Core Limitation
A research team is building a model to summarize extremely long scientific papers. They are comparing two distinct architectural approaches:
- Approach 1: Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next.
- Approach 2: Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context.
Which of the following statements best analyzes the primary trade-off between these two approaches for this specific task?
Architectural Design Choice for Machine Translation
Enablers of Universal Language Capabilities
Model Depth in Transformers
Generalization of the Language Modeling Concept
Transformer Block Sub-Layers
Standard Optimization Objective for Transformer Language Models