Number of Attention Heads
When configuring multi-head self-attention sub-layers in Transformers, one must specify the number of heads, denoted as . Increasing this hyperparameter expands the number of distinct subspaces over which attention is computed. In practical implementations, it is common to configure the model such that .
0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Self-Attention layer understanding - Step 5 - Adding the time
Query, Key, and Value Projections in Multi-Head Attention
Scalar per Head in Multi-Head Attention
In a multi-head self-attention mechanism, what is the primary advantage of using multiple parallel attention 'heads'—each with its own unique set of learnable weight matrices—compared to using a single attention mechanism with the same total dimensionality?
Analysis of a Modified Attention Mechanism
Arrange the following computational steps of a multi-head self-attention layer in the correct chronological order, starting from the point where the layer receives its input representation matrix.
Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement
Choosing Pre-Norm vs Post-Norm for a Deep Transformer: Stability, Shapes, and Sub-layer Semantics
Root-Cause Analysis of Training Instability After a “Minor” Transformer Block Change
Production Bug Triage: Transformer Block Norm Placement vs Attention/FFN Interface Contracts
Post-Norm vs Pre-Norm Migration: Verifying Tensor Shapes and Correct Sub-layer Wiring
Incident Review: Silent Performance Regression After “Optimization” of a Transformer Block
Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)
You are reviewing a teammate’s implementation of a...
You’re debugging a Transformer block in an interna...
You’re implementing a single Transformer block in ...
Number of Attention Heads
Reducing KV Cache Complexity via Head Sharing
Embedding Size in Transformer Models
Evaluating Language Model Design Choices
A research team is tasked with building a language model to analyze a large collection of specialized legal contracts. These documents contain a unique vocabulary and sentence structure not commonly found in general web text. When deciding how to approach this task, which of the following considerations is the most critical to address first to ensure the model's effectiveness?
Trade-offs in Language Model Vocabulary Design
Hidden Size in Transformer Models
Number of Attention Heads
FFN Hidden Size in Transformers
Model Depth in Transformers
Vocabulary Size in Transformers
Hidden Size in Transformer Models
A machine learning engineer is designing a Transformer encoder for a complex language task. Their primary goal is to improve the model's ability to capture diverse and varied contextual relationships (e.g., syntactic, semantic, co-reference) from different parts of the input sequence simultaneously. Which hyperparameter adjustment would most directly address this specific goal?
Hyperparameter Tuning Trade-offs
An engineer is configuring a Transformer encoder. Match each key hyperparameter to its specific architectural role.
FFN Hidden Size in Transformers
Vocabulary Size in Transformers
Model Depth in Transformers
Number of Attention Heads
Embedding Size in Transformer Models
Learn After
A machine learning engineer observes that their language model struggles to understand sentences with multiple, distinct syntactic relationships (e.g., identifying both the subject-verb and modifier-noun relationships in 'The quick brown fox, which was very agile, jumps over the lazy dog.'). The model's self-attention mechanism is currently configured with a single attention head. Which of the following changes is most likely to directly address this specific problem, and why?
Evaluating the Trade-offs of the Number of Attention Heads
Choosing the Number of Attention Heads for a Specific Task