Learn Before
  • Key-Value (KV) Cache in Transformer Inference

Space Complexity of the KV Cache

During inference, the space complexity of the Key-Value (KV) cache is directly proportional to the number of tokens for which keys and values are stored. This relationship is captured by the formula O(Lτdhm)O(L \cdot \tau \cdot d_h \cdot m), where LL is the number of layers, τ\tau is the number of attention heads, dhd_h is the head dimension, and mm is the number of tokens being cached.

Image 0

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Space Complexity of the KV Cache

  • Updating the KV Cache

  • Two-Phase Inference from a KV Cache Perspective

  • Single-Step Generation with a KV Cache

  • Memory Allocation for KV Caching in Standard Self-Attention

  • Multi-Dimensional Structure of the KV Cache

  • An autoregressive language model generates text one word at a time. To generate the 100th word, it must relate it to all 99 previous words. A common optimization involves storing in memory the intermediate representations for each of the first 99 words as they are generated.

    Which statement best analyzes the primary computational advantage of this optimization compared to re-computing everything from scratch at step 100?

  • Chatbot Performance Degradation

  • Computational Steps in Cached Inference

  • Diagnosing and Redesigning KV-Cache Memory Behavior in a Multi-Tenant LLM Serving Stack

  • Choosing a KV-cache strategy for shared-prefix traffic under GPU memory pressure

  • Evaluating a serving design that combines prefix caching with paged KV memory under mixed prompt lengths

  • Stabilizing latency and GPU memory in a chat-completions service with shared system prompts

  • Post-incident analysis: KV-cache growth, fragmentation, and shared-prefix reuse in a streaming LLM service

  • Root-cause and mitigation plan for OOMs and latency spikes during shared-prefix, long-generation traffic

  • You run an internal LLM inference service for empl...

  • Your company’s internal LLM service handles many c...

  • You operate a GPU-backed LLM service that uses con...

  • You’re on-call for an internal LLM chat service. M...

Learn After
  • Reducing KV Cache Complexity via Windowed Caching

  • An engineer is deploying a large autoregressive model for a chatbot. They observe that as a conversation with a user gets longer, the model's memory consumption increases steadily, eventually leading to performance issues. This is because the model stores key and value vectors for every token in the conversation history to speed up the generation of the next token. Based on this mechanism, what is the fundamental relationship between the length of the conversation history (in tokens) and the amount of memory required for this storage?

  • KV Cache Memory Footprint Comparison

  • Calculating Memory Growth for Token Caching