The Search Problem in LLM Inference
In Large Language Model (LLM) inference, beyond the model computation problem of calculating , lies the search problem. This problem focuses on how to efficiently find the best output sequence for a given input sequence (or the generated KV cache). A naive approach is exhaustive search, which considers every possible output sequence to select the one with the highest prediction probability. While this method guarantees a globally optimal solution, a direct exhaustive search is impractical for LLMs because the number of potential output sequences grows exponentially with the length of .

0
1
References
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
The Search Problem in LLM Inference
Next-Token Probability Calculation in a Transformer Decoder
In an autoregressive language model, after processing a sequence of input tokens, a corresponding sequence of hidden state vectors is produced by the final decoder layer. To predict the probability distribution for the single token that will come next, what is the correct procedure and why?
An autoregressive model generates text one token at a time. Arrange the following computational steps in the correct order to calculate the probability distribution for the very next token, given the current sequence of tokens.
Debugging a Language Model's Output Distribution
Learn After
Hypothesis in LLM Inference
Mathematical Formulation of the Search Problem in LLM Inference
Exploration vs. Exploitation in LLM Search
Search Tree Structure in Token Generation
Heuristic Search Algorithms for LLM Inference
Efficient Generation of Candidate Solutions via Search Algorithms
Search for Optimal or Sub-optimal Sequences in LLM Inference
Root of the Search Space as a Representation of Input (x)
A text generation model has a vocabulary of 10,000 possible words it can choose from for each position in a sequence. If this model were to find the optimal output by evaluating every single possible sequence, how would the total number of sequences to check change if the desired output length is increased from 3 words to 5 words?
Evaluating an Inference Strategy
The Impracticality of Exhaustive Search
Historical Context and Computational Challenges of Maximum Probability Prediction
Mathematical Representation of an Output Sequence