Learn Before
Inference Architecture of k-NN Language Models
The inference architecture of a k-NN Language Model involves two parallel computational streams that originate from a query vector q_i. The first stream, the base model pathway, processes q_i using standard attention over the local KV cache to generate a base probability distribution, Pr(.), over the vocabulary. The second stream, the k-NN pathway, uses q_i to search an external datastore and retrieve the k nearest neighbors. The target tokens associated with these neighbors are then used to construct a k-NN probability distribution, Pr_knn(.). In the final step, these two distributions are interpolated or combined to produce the final output distribution, which leverages both local context and long-term memory from the datastore to predict the next token.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Inference Architecture of k-NN Language Models
Next-Token Prediction with External Memory
A language model is enhanced by searching a large datastore of past internal states and their corresponding next words. When the model generates a new word, it finds the 'k' most similar past states from the datastore and uses their associated next words to adjust its prediction. What is the key principle that makes this technique effective?
Foundational Principle of k-NN Language Modeling
A language model is designed to improve its next-word predictions by consulting a large external database of past contexts. Arrange the following steps to accurately describe how this model generates its final output after receiving an input.
A language model is designed to enhance its next-token prediction by referencing a large external datastore of context representations and their corresponding subsequent tokens. During generation, for a given input, the model identifies the 'k' most similar context representations from this datastore. Which of the following best describes how this information is integrated to produce the final prediction?
You’re on-call for an internal engineering assista...
You are reviewing two proposed designs for an inte...
Your team is building an internal “Release Notes Q...
You’re designing an internal LLM assistant for a c...
Design Review: Choosing Between RAG and k-NN LM for a Regulated Support Assistant
Post-Incident Analysis: Why a RAG Assistant Hallucinated Despite “Having the Docs”
Architecture Decision Memo: Unifying Vector-DB RAG and k-NN LM for a Global Policy Assistant
Case Study: Root-Cause Analysis of “Correct Source, Wrong Answer” in a RAG + k-NN LM Assistant
Case Study: Debugging a RAG Assistant with a Vector DB and a k-NN LM Memory
Case Review: Diagnosing Conflicting Answers in a Hybrid Retrieval System
Learn After
Retrieving Reference Tokens in k-NN LM Inference
A language model architecture is designed to predict the next token by using two parallel computational streams that originate from the same query vector. The first stream uses the immediate, local context to generate a probability distribution over the vocabulary. The second stream uses the query vector to search a large external datastore, find the most similar historical contexts, and generate a second probability distribution based on the tokens that followed those contexts. The two distributions are then combined to produce the final prediction. What is the primary functional distinction between the information provided by these two streams?
Visual Representation of k-NN Language Model Inference
Diagnosing an Error in a Hybrid Language Model
A language model architecture enhances its predictions by combining information from its immediate context with knowledge from a large external repository. Arrange the following steps to accurately describe the data flow during its inference process.