Concept

k-NN Search Augmented Attention

k-NN search augmented attention is an architecture that processes a query vector, qi, through two parallel computational streams. The first stream computes standard attention over a local memory, such as the KV cache, to capture immediate context. Concurrently, the second stream uses qi to perform a k-NN search on an external datastore, retrieving the k nearest neighbors. A separate attention calculation is then performed over these retrieved neighbors. Finally, the outputs from these two streams are integrated to produce a final result. This integration can occur at different levels, for instance, by combining the attention output vectors directly or by converting each stream's output into a probability distribution and then merging them.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences