Cross-layer Multi-head Attention
Cross-layer Multi-head Attention is an architectural variant in Transformers where an attention layer directly accesses the Key-Value (KV) cache of a lower-level layer. By sharing KV activations or attention weights across consecutive layers, a query in the current layer can utilize the keys and values computed previously. This method is used to effectively reduce both the computational requirements and the overall memory footprints of the model.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Multi-Query Attention (MQA)
Grouped-Query Attention (GQA)
Cross-layer Multi-head Attention
Diagnosing Attention Head Redundancy
An engineer observes that during the training of a transformer-based model, several attention heads within the same layer consistently produce nearly identical attention patterns for a wide variety of inputs. Despite the model having many heads, this redundancy seems to limit the model's ability to capture diverse linguistic features. This scenario highlights a key motivation for developing more advanced attention mechanisms. What is the most direct problem with the standard multi-head attention design that this observation reveals?
Rationale for Advanced Attention Mechanisms
Cross-Layer Parameter Sharing in BERT
Cross-layer Multi-head Attention
A team of engineers is designing a deep neural network for a resource-constrained environment, such as a mobile device. To reduce the model's size, they implement a design where the same computational block, with its entire set of weights, is reused at every layer of the network. What is the most significant trade-off the engineers must consider with this approach?
Analyzing a Novel Transformer Architecture
Comparing Parameter Sharing Strategies