Concept

Two-Phase Inference from a KV Cache Perspective

In Transformer-based language models, which operate as autoregressive systems, each new token is generated based on all preceding tokens. This process necessitates a Key-Value (KV) cache to store the representations of past tokens, allowing the model to attend to this history efficiently. When analyzing the generation of a sequence, represented as Pr(y|x), from the standpoint of KV cache computation, the inference process can be naturally separated into two distinct phases.

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related