Concept

Cross-layer Multi-head Attention

Cross-layer Multi-head Attention is an architectural variant in Transformers where an attention layer directly accesses the Key-Value (KV) cache of a lower-level layer. By sharing KV activations or attention weights across consecutive layers, a query in the current layer can utilize the keys and values computed previously. This method is used to effectively reduce both the computational requirements and the overall memory footprints of the model.

Image 0

0

1

Updated 2026-04-23

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Computing Sciences

Foundations of Large Language Models Course