Concept

Hybrid Cache for Attention Mechanisms

A hybrid cache is a memory management strategy that combines two types of memory to efficiently handle long sequences. As illustrated in the diagram, it consists of a 'Local Memory' and a 'Compressed Memory'. The Local Memory (e.g., size 4x2) stores a fixed number of the most recent key-value pairs in their original, uncompressed form. As new data arrives, the oldest key-value pairs are evicted from the Local Memory. These evicted pairs are then passed through a compression function and stored in the Compressed Memory (e.g., size 2x2). This two-level approach allows a model to maintain high-fidelity information about the recent past while retaining a summarized, space-efficient representation of the more distant past.

Image 0

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences