1Cademy - Predicting Attention Computation Time

Learn Before

Computational Cost per Token in Causal Attention

Short Answer

Predicting Attention Computation Time

An autoregressive language model takes 50 milliseconds to compute the attention for the 100th token in a sequence. Based on the computational scaling of the causal attention mechanism at a single generation step, approximately how long would you expect it to take to compute the attention for the 400th token in the same sequence? Explain your reasoning.

Updated 2025-10-06

Contributors are:

Who are from:

Learn Before

Related