Learn Before
Visual Representation of k-NN Language Model Inference
This diagram illustrates the inference process of a k-NN Language Model. The architecture operates with two parallel streams originating from a query vector q_i. The first stream represents the base Large Language Model (LLM), where the query interacts with the local Key-Value (KV) cache to produce a standard probability distribution over the vocabulary, denoted as Distribution Pr(.). In parallel, the second stream uses the same query to search an external datastore and retrieve its k nearest neighbors. These neighbors, which consist of keys and their corresponding next tokens, are then used to form a k-NN probability distribution, Distribution Pr_knn(.). In the final step, these two distributions are combined to generate the final Output Distribution for next-token prediction.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Retrieving Reference Tokens in k-NN LM Inference
A language model architecture is designed to predict the next token by using two parallel computational streams that originate from the same query vector. The first stream uses the immediate, local context to generate a probability distribution over the vocabulary. The second stream uses the query vector to search a large external datastore, find the most similar historical contexts, and generate a second probability distribution based on the tokens that followed those contexts. The two distributions are then combined to produce the final prediction. What is the primary functional distinction between the information provided by these two streams?
Visual Representation of k-NN Language Model Inference
Diagnosing an Error in a Hybrid Language Model
A language model architecture enhances its predictions by combining information from its immediate context with knowledge from a large external repository. Arrange the following steps to accurately describe the data flow during its inference process.
Learn After
Analyzing Factual Recall in a Dual-Stream Language Model
A diagram of a language model's inference process shows two parallel streams originating from a single query vector. The first stream processes the query against a local cache of recent context to produce a probability distribution. The second stream uses the same query to search a large external datastore, retrieving similar past examples to form a second probability distribution. Finally, these two distributions are combined for the final prediction. What is the primary advantage of this dual-stream architecture as depicted?
A diagram of a k-NN Language Model's inference process shows several key components. Match each component with its correct function in the process.