Example

Visual Representation of k-NN Language Model Inference

This diagram illustrates the inference process of a k-NN Language Model. The architecture operates with two parallel streams originating from a query vector q_i. The first stream represents the base Large Language Model (LLM), where the query interacts with the local Key-Value (KV) cache to produce a standard probability distribution over the vocabulary, denoted as Distribution Pr(.). In parallel, the second stream uses the same query to search an external datastore and retrieve its k nearest neighbors. These neighbors, which consist of keys and their corresponding next tokens, are then used to form a k-NN probability distribution, Distribution Pr_knn(.). In the final step, these two distributions are combined to generate the final Output Distribution for next-token prediction.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences