1Cademy - Cross-layer Multi-head Attention

Learn Before

Improved Multi-Head Attention Mechanism
Cross-Layer Parameter Sharing in Transformers

Concept

Cross-layer Multi-head Attention

Cross-layer Multi-head Attention is an architectural variant in Transformers where an attention layer directly accesses the Key-Value (KV) cache of a lower-level layer. By sharing KV activations or attention weights across consecutive layers, a query in the current layer can utilize the keys and values computed previously. This method is used to effectively reduce both the computational requirements and the overall memory footprints of the model.

Updated 2026-04-23

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn Before

Related