The memory footprint of the Key-Value (KV) cache can be decreased not only by reducing the number of tokens cached (represented by the sequence length, $$m$$) but also along other architectural dimensions. A widely adopted approach to achieve this is by enabling the sharing of keys and values across the various attention heads within a multi-head self-attention mechanism.

Google

During inference, the space complexity of the Key-Value (KV) cache is directly proportional to the number of tokens for which keys and values are stored. This relationship is captured by the formula $$O(L \cdot \tau \cdot d_h \cdot m)$$, where $$L$$ is the number of layers, $$\tau$$ is the number of attention heads, $$d_h$$ is the head dimension, and $$m$$ is the number of tokens being cached.

Space Complexity of the KV Cache

The multi-head self-attention function operates on an input representation matrix, $$\mathbf{H} \in \mathbb{R}^{m 	imes d}$$. Rather than using a single set of attention parameters, this mechanism employs $$h$$ parallel 'attention heads'. Each head has its own unique set of learnable weight matrices for Query, Key, and Value projections. An attention pooling function $$f$$—such as additive attention or scaled dot-product attention—is applied independently within each head. The outputs from all heads are then concatenated and projected through a final linear transformation to produce the layer's output. Because each head operates in its own learned subspace, different heads may focus on different parts of the input, enabling the model to jointly attend to information from multiple representational subspaces at different positions. This design allows the mechanism to express more sophisticated functions than a simple weighted average.

Multi-Head Self-Attention Function

Reference of Foundations of Large Language Models Course

The space complexity of the standard Key-Value (KV) cache, which grows linearly with the number of tokens $$m$$ as $$O(L \cdot \tau \cdot d_h \cdot m)$$, can be reduced by caching fewer tokens. For instance, sliding window attention utilizes a fixed-size window $$m_w$$ to store keys and values only for the local context. This restricts the caching mechanism's space complexity to a constant $$O(L \cdot \tau \cdot d_h \cdot m_w)$$, making it more manageable regardless of the overall sequence length.

Reducing KV Cache Complexity via Windowed Caching

An engineer is deploying a large autoregressive model for a chatbot. They observe that as a conversation with a user gets longer, the model's memory consumption increases steadily, eventually leading to performance issues. This is because the model stores key and value vectors for every token in the conversation history to speed up the generation of the next token. Based on this mechanism, what is the fundamental relationship between the length of the conversation history (in tokens) and the amount of memory required for this storage?

Based on the provided model specifications, determine which model will require a larger key-value cache to process an input sequence of 1000 tokens and justify your reasoning.

KV Cache Memory Footprint Comparison

An autoregressive model uses a mechanism to store key and value vectors for previously processed tokens to speed up inference. For a sequence of 100 tokens, this storage mechanism consumes 200 MB of memory. Assuming all other model parameters remain constant, how much memory would this mechanism consume for a sequence of 400 tokens? Explain your reasoning.

Calculating Memory Growth for Token Caching

Reducing KV Cache Complexity via Head Sharing

The memory footprint of the Key-Value (KV) cache for a specific context window size can be quantified. The total size is proportional to the product of four key parameters: the number of layers in the model ($$L$$), the number of attention heads per layer ($$\tau$$), the dimensionality of each head's key/value vectors ($$d_h$$), and the size of the context window ($$m_w$$). The overall memory complexity is therefore given by the formula: $$O(L \cdot \tau \cdot d_h \cdot m_w)$$.

Formula for KV Cache Memory Size

So as you may have noticed in the current state the order of the words do not matter at all. We can permute the sentence but the result would be the same. In this case instead of using RNN to account for the order we can calculate positional encoding for the each timestamp and just add it to the word embeddings(note that we do it once right after the embedding layer). That positional encoding is calculated so that projected vectors into Q/K/V vectors have some meaning full distance in between them. Here is the example of how to it is calculated for the 20 words (rows) with an embedding size of 512 (columns)

Self-Attention layer understanding - Step 5 - Adding the time

In a multi-head attention mechanism, the queries, keys, and values for the $$j$$-th attention head are obtained by projecting the input representation $$\mathbf{H}$$ into different subspaces via linear transformations. These transformations utilize unique learnable parameter matrices for each head. The projections are defined as follows:

$$\mathbf{Q}^{[j]} = \mathbf{H} \mathbf{W}_j^{q}$$
$$\mathbf{K}^{[j]} = \mathbf{H} \mathbf{W}_j^{k}$$
$$\mathbf{V}^{[j]} = \mathbf{H} \mathbf{W}_j^{v}$$

Here, $$\mathbf{W}_j^{q}$$, $$\mathbf{W}_j^{k}$$, and $$\mathbf{W}_j^{v} \in \mathbb{R}^{d \times \frac{d}{\tau}}$$ denote the parameter matrices of the transformations for the $$j$$-th head.

Query, Key, and Value Projections in Multi-Head Attention

In multi-head attention mechanisms, each individual attention head can be associated with a unique scalar value. This allows for different behaviors or biases to be applied on a per-head basis, as seen in techniques like ALiBi.

Scalar per Head in Multi-Head Attention

In a multi-head self-attention mechanism, what is the primary advantage of using multiple parallel attention 'heads'—each with its own unique set of learnable weight matrices—compared to using a single attention mechanism with the same total dimensionality?

Imagine a modified multi-head attention mechanism where all attention 'heads' are forced to share the exact same set of learnable weight matrices for their Query, Key, and Value projections. Analyze the primary consequence of this modification on the model's ability to process information. How would this change its behavior compared to a standard multi-head attention layer?

Analysis of a Modified Attention Mechanism

Arrange the following computational steps of a multi-head self-attention layer in the correct chronological order, starting from the point where the layer receives its input representation matrix.

You are reviewing a teammate’s refactor of a Transformer block used in an internal LLM service. The block should contain (1) a multi-head self-attention sub-layer and (2) a two-layer position-wise FFN, each with a residual connection and layer normalization. The model uses sequence length m and model width d, and the attention module uses h heads.

During integration testing, you see two symptoms: (A) training becomes unstable when stacking many blocks (loss spikes and occasional divergence), and (B) a shape error appears in the FFN path when d=512 and the hidden width is set to d_h=2048.

Write a technical diagnosis that:
1) Identifies the most likely normalization/residual ordering mistake that would explain symptom (A), explicitly contrasting pre-norm vs post-norm computation for a sub-layer and why one tends to be more stable in very deep stacks.
2) Explains, using correct tensor/matrix dimensions, how multi-head self-attention can take an input H ∈ R^{m×d} and still return an output in R^{m×d} (include what happens across heads conceptually, and why concatenation must be followed by a projection).
3) Pinpoints the FFN dimension mismatch that could cause symptom (B) and states the required shapes of W_h, b_h, W_f, and b_f so that the FFN maps an input of width d back to width d.

Your answer should connect these three parts into a single coherent explanation of how a correct Transformer block maintains consistent dimensions while using residuals and layer normalization, and how an incorrect norm placement can interact with deep stacking to destabilize training.

Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement

You are reviewing two candidate implementations of a Transformer block for an internal LLM that must be scaled from 12 to 96 layers without changing the model dimension d. Each block has (1) a multi-head self-attention sub-layer that maps an input H ∈ R^{m×d} to an output in R^{m×d} by running multiple attention heads in parallel, concatenating their outputs, and applying a final linear projection, and (2) a position-wise two-layer FFN applied independently to each token: FFN(h)=σ(hW_h+b_h)W_f+b_f with W_h ∈ R^{d×d_h} and W_f ∈ R^{d_h×d}. Both designs use residual connections and layer normalization (LN), where LN normalizes each token’s d features using that token’s mean and standard deviation and then applies learnable gain and bias.

Design A (post-norm) uses, for each sub-layer: y = LN(x + F(x)).
Design B (pre-norm as defined here) uses, for each sub-layer: y = LN(F(x)) + x.

In early training runs at 96 layers, Design A frequently diverges (loss becomes NaN) while Design B trains but shows slightly slower early loss reduction.

Write an engineering recommendation memo (as an essay) that: (a) argues which design you would choose for the 96-layer model and why, explicitly linking your reasoning to how LN placement interacts with residual connections across many stacked blocks; (b) demonstrates that you understand the required tensor shapes through the attention and FFN sub-layers (i.e., why both F(x) terms can be added to x and why the FFN must use W_h and W_f with the given dimensions); and (c) explains one plausible tradeoff your choice introduces for model behavior or optimization (e.g., gradient flow, representational scaling, or sensitivity to initialization), grounded in the two formulas above rather than generic statements.

Choosing Pre-Norm vs Post-Norm for a Deep Transformer: Stability, Shapes, and Sub-layer Semantics

You inherit a production LLM codebase where a teammate made a “minor cleanup” to the Transformer block. After the change, training becomes unstable (loss spikes and occasional NaNs) only when scaling from 12 to 48 layers; the 12-layer model still trains. The teammate claims they only (a) moved LayerNorm, and (b) refactored the attention and FFN code for readability.

Assume the model uses token representations H ∈ R^{m×d}. The self-attention sub-layer is multi-head self-attention: for each head j, Q^{[j]} = H W_j^q, K^{[j]} = H W_j^k, V^{[j]} = H W_j^v; each head output is computed via scaled dot-product attention, head outputs are concatenated, then projected back to dimension d. The FFN is two linear layers with a nonlinearity: FFN(h) = σ(h W_h + b_h) W_f + b_f, where W_h ∈ R^{d×d_h} and W_f ∈ R^{d_h×d}. LayerNorm normalizes each token’s d features and has learnable gain/bias.

Write an engineering memo that (1) proposes the two most plausible implementation mistakes that could simultaneously explain “works at 12 layers but diverges at 48 layers” and are consistent with the teammate’s description, and (2) for each mistake, explains the mechanism of failure by explicitly connecting (i) residual + LayerNorm placement (pre-norm vs post-norm), (ii) how multi-head attention and FFN preserve/return to dimension d, and (iii) why depth amplifies the issue. Conclude with a concrete, minimal patch (in words or pseudocode) that would fix each mistake and a quick sanity-check you would run to confirm the fix.

Root-Cause Analysis of Training Instability After a “Minor” Transformer Block Change

You are reviewing a teammate’s Transformer block implementation for an internal LLM service. The model uses hidden size d=1024, sequence length m=256, number of attention heads h=16 (so per-head dimension is 64), and FFN hidden size d_h=4096 with ReLU. The teammate reports that training becomes unstable (loss spikes and occasional NaNs) after a refactor that was intended to be “behavior-preserving.” They provide the following pseudocode for one block:

Input: H (shape m×d)
1) A = MultiHeadSelfAttention(LNorm(H))
2) H1 = LNorm(H + A)
3) F = FFN(LNorm(H1))
4) H2 = H1 + LNorm(F)
Output: H2

Assume MultiHeadSelfAttention follows the standard pattern: for each head j, Q[j]=X Wq[j], K[j]=X Wk[j], V[j]=X Wv[j], scaled dot-product attention is computed per head, head outputs are concatenated, then projected back to d.

Case task: Identify whether this block is consistently implementing a pre-norm scheme, a post-norm scheme, or an inconsistent mixture, and explain (a) the most likely stability-related issue caused by the mixture in terms of residual-path “cleanliness” and normalization placement, and (b) one concrete corrected block formula (in equations or pseudocode) that makes the normalization placement consistent while keeping all tensor dimensions valid for both the attention sub-layer and the FFN sub-layer.

Production Bug Triage: Transformer Block Norm Placement vs Attention/FFN Interface Contracts

You are reviewing a teammate’s pull request that “converts a Transformer block to pre-norm for better stability” in an internal LLM used for document triage. The model uses representation size d=512, FFN hidden size d_h=2048, sequence length m=128, and multi-head self-attention with n_head=8 (so each head uses d_head=64). The PR includes the following pseudocode for one block (self-attention sub-layer then FFN sub-layer):

1) a = LN(x)
2) attn_out = MultiHeadSelfAttention(a)   # returns shape (m, 512)
3) y = LN(attn_out + x)
4) f = LN(y)
5) ffn_out = ReLU(f * W_h + b_h) * W_f + b_f
6) out = LN(ffn_out + y)

The author claims this is “pre-norm” because LN is applied before each function. During training, you still see instability and slower convergence than expected.

As the reviewer, identify whether this block is actually pre-norm, post-norm, or a hybrid, and explain (a) the minimal change(s) needed to make it a true pre-norm block for both sub-layers, and (b) the required dimensions of W_h and W_f so that the FFN preserves the (m, d) interface expected by the residual connections. Your answer must explicitly reference how residual connections, layer normalization placement, and the attention/FFN output shapes interact in this block.

Post-Norm vs Pre-Norm Migration: Verifying Tensor Shapes and Correct Sub-layer Wiring

You are reviewing a production incident in an internal LLM service. A teammate made an “optimization” to a Transformer block to reduce compute and simplify code. After the change, training no longer diverges, but model quality drops noticeably (worse long-context retrieval and weaker instruction following) while throughput improves. You are given the following implementation notes for one block (sequence length m, model width d):

- The block has two sub-layers in order: (1) multi-head self-attention, (2) a 2-layer FFN.
- The teammate changed the FFN hidden size from d_h = 4d to d_h = d.
- They also changed normalization placement from post-norm to pre-norm, but their code now does:
  1) y = x + Attention(LNorm(x))
  2) z = y + FFN(LNorm(y))
  3) output = LNorm(z)
- They claim this is “equivalent but faster” because the FFN is narrower and “extra LNorm at the end keeps things stable.”

As the reviewer, identify the most likely *two* root causes of the quality regression that follow from the interaction of (a) multi-head self-attention’s role, (b) the FFN’s dimensionality/structure, and (c) layer normalization placement (pre-norm vs post-norm). Then propose one concrete code-level correction (in words or pseudocode) that would address the regression while keeping training stable, and justify why it helps.

Incident Review: Silent Performance Regression After “Optimization” of a Transformer Block

You are writing a one-page implementation spec for a new internal Transformer-block API that must be unambiguous enough for two different teams (training + inference) to implement independently and still produce identical tensor shapes and computation order.

Constraints:
- The block input is H ∈ R^{m×d} (m = sequence length, d = model width).
- The block contains exactly two sub-layers in this order: (1) multi-head self-attention, (2) a 2-layer position-wise FFN.
- Multi-head attention uses n_head heads with per-head dimension d_k such that concatenation returns to width d.
- The FFN must expand to hidden width d_h and return to width d using the standard formula FFN(h)=σ(hW_h+b_h)W_f+b_f.
- You must choose either a pre-norm or post-norm scheme and specify precisely where LayerNorm is applied relative to F(·) and the residual addition for BOTH sub-layers.

Create the spec by writing:
1) A step-by-step computation graph (as numbered equations) for the full block from input H to output H_out, including residual connections and LayerNorm placement.
2) The required matrix dimensions for W_q, W_k, W_v, the output projection W_o, and the FFN matrices W_h and W_f (use d, d_h, n_head, d_k; you may assume d = n_head·d_k).

Your answer must be internally consistent: every addition must be shape-compatible, and your norm placement must match the scheme you chose.

Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)

You are reviewing a teammate’s implementation of a...

You’re debugging a Transformer block in an interna...

You’re implementing a single Transformer block in ...

When configuring multi-head self-attention sub-layers in Transformers, one must specify the number of heads, denoted as $$n_{\mathrm{head}}$$. Increasing this hyperparameter expands the number of distinct subspaces over which attention is computed. In practical implementations, it is common to configure the model such that $$n_{\mathrm{head}} \ge 4$$.

Number of Attention Heads

To compute the $$h$$ heads of a multi-head attention mechanism in parallel, proper tensor manipulation is necessary to align the data for the underlying attention pooling function. The input tensors containing the concatenated queries, keys, and values—typically of shape $$(	ext{batch\_size}, 	ext{num\_queries}, 	ext{num\_hiddens})$$—are first reshaped to explicitly separate the $$h$$ heads, yielding a shape of $$(	ext{batch\_size}, 	ext{num\_queries}, h, 	ext{num\_hiddens} / h)$$. A transposition operation then swaps the sequence length dimension with the head dimension. Finally, flattening the batch and head dimensions together results in a shape of $$(	ext{batch\_size} 	imes h, 	ext{num\_queries}, 	ext{num\_hiddens} / h)$$. This layout allows a standard attention function to process all heads simultaneously. Following the attention computation, a reverse sequence of transpositions and reshapes is applied to concatenate the individual head outputs back into a single tensor.

Learn Before

Related