Learn Before
Activity (Process)

Inference Architecture of k-NN Language Models

The inference architecture of a k-NN Language Model involves two parallel computational streams that originate from a query vector q_i. The first stream, the base model pathway, processes q_i using standard attention over the local KV cache to generate a base probability distribution, Pr(.), over the vocabulary. The second stream, the k-NN pathway, uses q_i to search an external datastore and retrieve the k nearest neighbors. The target tokens associated with these neighbors are then used to construct a k-NN probability distribution, Pr_knn(.). In the final step, these two distributions are interpolated or combined to produce the final output distribution, which leverages both local context and long-term memory from the datastore to predict the next token.

Image 0

0

1

Updated 2025-10-10

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related