Learn Before
Transformer
Computational Cost of Self-Attention in Transformers
The self-attention mechanism, a core component of the Transformer architecture, exhibits a computational complexity that scales quadratically with the length of the input sequence. This characteristic makes it prohibitively expensive and impractical to train or deploy Transformer-based models on tasks involving very long texts.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Self-attention layers' first approach
Transformers in contextual generation and summarization
Huggingface Model Summary
A Survey of Transformers (Lin et. al, 2021)
Overview of a Transformer
Model Usage of Transformers
Attention in vanilla Transformers
Transformer Variants (X-formers)
The Pre-training and Fine-tuning Paradigm
Architectural Categories of Pre-trained Transformers
Transformer Blocks and Post-Norm Architecture
Model Depth (L) in Transformers
Transformers as Language Models
Computational Cost of Self-Attention in Transformers
Learn After
KV Cache during Transformer Inference
Architectural Adaptation of LLMs for Long Sequences
Cross-Layer Parameter Sharing in Transformers