Concept

Sequence-Level Caching for LLM Inference

A basic caching method where complete input sequences are mapped to their corresponding LLM-generated outputs in a key-value datastore, such as a hash table. This cache can be populated by pre-computing and storing responses for frequently encountered queries. The system then bypasses LLM inference for any incoming request that is an exact match for a cached query, serving the stored response directly.

0

1

Updated 2026-05-05

Contributors are:

Who are from:

Tags

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences