Learn Before
Scalar per Head in Multi-Head Attention
In multi-head attention mechanisms, each individual attention head can be associated with a unique scalar value. This allows for different behaviors or biases to be applied on a per-head basis, as seen in techniques like ALiBi.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Self-Attention layer understanding - Step 5 - Adding the time
Query, Key, and Value Projections in Multi-Head Attention
Scalar per Head in Multi-Head Attention
In a multi-head self-attention mechanism, what is the primary advantage of using multiple parallel attention 'heads'—each with its own unique set of learnable weight matrices—compared to using a single attention mechanism with the same total dimensionality?
Analysis of a Modified Attention Mechanism
Arrange the following computational steps of a multi-head self-attention layer in the correct chronological order, starting from the point where the layer receives its input representation matrix.
Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement
Choosing Pre-Norm vs Post-Norm for a Deep Transformer: Stability, Shapes, and Sub-layer Semantics
Root-Cause Analysis of Training Instability After a “Minor” Transformer Block Change
Production Bug Triage: Transformer Block Norm Placement vs Attention/FFN Interface Contracts
Post-Norm vs Pre-Norm Migration: Verifying Tensor Shapes and Correct Sub-layer Wiring
Incident Review: Silent Performance Regression After “Optimization” of a Transformer Block
Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)
You are reviewing a teammate’s implementation of a...
You’re debugging a Transformer block in an interna...
You’re implementing a single Transformer block in ...
Number of Attention Heads
Reducing KV Cache Complexity via Head Sharing
Learn After
Evaluating a Modification to Multi-Head Attention
An engineer modifies a standard multi-head attention layer by multiplying the output of each attention head by a unique, pre-defined (non-learnable) scalar value before the final concatenation and projection. What is the most significant functional consequence of this modification?
Rationale for Per-Head Scalars in Attention Mechanisms
Geometric Progression for ALiBi's Scalar per Head