Formula

Attention Head Output with Grouped Queries and Causal Masking

This formula calculates the output for a single attention head, head_j, in a transformer model that implements Grouped-Query Attention (GQA) with causal masking. The formula is: $\text{head}_j = \text{Att}_{\text{qkv}}(\mathbf{q}_i^{[j]}, \mathbf{K}_{\le i}^{[g(j)]}, \mathbf{V}_{\le i}^{[g(j)]})$. In this equation, $\text{Att}_{\text{qkv}}$ represents the attention function, and $\mathbf{q}_i^{[j]}$ is the query vector for the current token i and head j. The keys $\mathbf{K}$ and values $\mathbf{V}$ are shared among a group of query heads, as determined by the function $g(j)$. The subscript ${\le i}$ signifies that the attention is causal, meaning it only considers tokens up to the current position i.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related