1Cademy - Visual Representation of k-NN Language Model Inference

Learn Before

Inference Architecture of k-NN Language Models

Example

Visual Representation of k-NN Language Model Inference

This diagram illustrates the inference process of a k-NN Language Model. The architecture operates with two parallel streams originating from a query vector q_i. The first stream represents the base Large Language Model (LLM), where the query interacts with the local Key-Value (KV) cache to produce a standard probability distribution over the vocabulary, denoted as Distribution Pr(.). In parallel, the second stream uses the same query to search an external datastore and retrieve its k nearest neighbors. These neighbors, which consist of keys and their corresponding next tokens, are then used to form a k-NN probability distribution, Distribution Pr_knn(.). In the final step, these two distributions are combined to generate the final Output Distribution for next-token prediction.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course

Learn Before

Related

Learn After