Short Answer

Efficiency of Aggregated State in Attention

An attention mechanism calculates the output for the i-th token using the formula: Attoutput=qiμiqiνiAtt_{output} = \frac{\mathbf{q}'_i \mu_i}{\mathbf{q}'_i \nu_i} In this formula, μi\mu_i and νi\nu_i are state variables that aggregate information from all tokens from position 1 to ii. Explain how computing the output using these two aggregated state variables, rather than by directly comparing the query qi\mathbf{q}'_i with every individual prior key, contributes to the mechanism's memory efficiency for long sequences.

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science