Concept

Vector Products per Self-Attention Step

During a single step of standard autoregressive generation, attending a position ii' to all previous context positions requires exactly 2i{}2 i' vector products. This total is comprised of ii' products needed for the query-key dot product (qiKT\mathbf{q}_{i'} \mathbf{K}^{\mathrm{T}}), plus an additional ii' products to multiply the Softmax-normalized attention scores with the value matrix (Softmax(qiKTd)V\mathrm{Softmax}(\frac{\mathbf{q}_{i'} \mathbf{K}^{\mathrm{T}}}{\sqrt{d}}) \mathbf{V}).

0

1

Updated 2026-05-03

Contributors are:

Who are from:

Tags

Foundations of Large Language Models

Ch.5 Inference - Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences