Cache Eviction Policies for Prefix Caching
To manage the significant memory overhead associated with prefix caching, practical systems employ cache eviction policies. These strategies, such as the least recently used (LRU) method, dictate which cached prefixes should be removed when memory becomes full. The primary objective of these policies is to find an optimal balance between the computational performance gained from caching and the inherent memory constraints of the system.

0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Computing Sciences
Foundations of Large Language Models Course
Related
Process of Generating Prefix Caches
Process of Utilizing a Prefix Cache
Implementing Prefix Caching with a Key-Value Datastore
Memory Management Challenges in Prefix Caching
Cache Eviction Policies for Prefix Caching
An LLM inference system is designed to optimize performance by storing the intermediate hidden states generated from the initial tokens of user prompts. The system has just finished processing the request: 'Analyze the market trends for electric vehicles in North America.' Immediately after, it receives a new request: 'Analyze the market trends for electric vehicles in Europe.' How will the system leverage its optimization technique to process this second request?
Evaluating Caching Strategy Effectiveness
Choosing an Optimal Caching Strategy
You run an internal LLM inference service for empl...
You’re on-call for an internal LLM chat service. M...
You operate a GPU-backed LLM service that uses con...
Your company’s internal LLM service handles many c...
Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths
Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure
Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack
Stabilizing latency and GPU memory in a chat-completions service with shared system prompts
Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic
Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service
Cache Eviction Policies for Prefix Caching
Learn After
An inference system for a large language model uses a cache for text prefixes to speed up processing. The cache has a capacity of 3 slots and uses a Least Recently Used (LRU) eviction policy. The cache is currently full, and its state, from most recently used to least recently used, is as follows:
- Prefix A: "The capital of France is"
- Prefix B: "Translate the following sentence to German:"
- Prefix C: "Once upon a time in a land far away,"
Now, a new user request arrives with the prompt: "The capital of France is Paris." This request is a 'hit' for Prefix A. Immediately after, another request arrives with a new, uncached prefix: "Summarize the main points of the article below:". To store this new prefix, one of the existing prefixes must be evicted. Which prefix will be removed from the cache?
Evaluating Cache Eviction Policy Suitability
An LLM inference system uses a prefix cache with a fixed capacity. The cache is currently full. A new user request arrives with a prefix that is not present in the cache (a 'cache miss'). To make space for this new prefix, the system must evict an existing one based on the Least Recently Used (LRU) policy. Arrange the following actions in the correct chronological order.