1Cademy - Cross-Layer Parameter Sharing in Transformers

Learn Before

Shared Weight and Shared Activation Methods

Concept

Cross-Layer Parameter Sharing in Transformers

Cross-layer sharing is an optimization method in Transformers that falls under the broader family of shared weight and shared activation methods. By sharing elements like Key-Value (KV) activations or attention weights across different layers, this technique reduces both computational demands and memory footprints. For example, a query in a higher layer can directly access the KV cache of a lower-level layer, eliminating redundant activations.