The Key-Value (KV) cache in Transformers is a data structure used during inference to store the key and value vectors of previously processed tokens. This cache is dynamic and its size can be adjusted across multiple dimensions, including the number of layers, attention heads, and the length of the sequence. As more tokens are processed, the cache grows, typically linearly with the sequence length.

KV Cache during Transformer Inference

To overcome the challenges of processing long sequences, the architecture of Large Language Models is evolving. Driven by issues like the quadratic time complexity of self-attention and the significant memory footprint of the KV cache, model design is shifting away from the standard Transformer towards more efficient variants and alternative architectures.

Architectural Adaptation of LLMs for Long Sequences

An optimization strategy for Transformer models where parameters are shared across different layers. This approach belongs to the broader family of shared weight and shared activation methods. Specific examples include sharing Key-Value (KV) activations or the attention mechanism itself between layers.

Cross-Layer Parameter Sharing in Transformers

The self-attention mechanism, a core component of the Transformer architecture, exhibits a computational complexity that scales quadratically with the length of the input sequence. This characteristic makes it prohibitively expensive and impractical to train or deploy Transformer-based models on tasks involving very long texts.

Google

The Transformer is a neural network architecture that relies entirely on self-attention mechanisms to process input and output sequences, avoiding the use of recurrent or convolutional layers. This design has become the foundational basis for the majority of modern pre-trained models in Natural Language Processing (NLP).

Transformer

Reference of Foundations of Large Language Models Course

 a self-attention layer maps input sequences(x1,...,xn) to output sequences of the same length (${y_1},...,{y_n}$). When processing each item in the input, the model has access to all of the inputs up to and including the one under consideration, but no access to information about inputs beyond the current one. 
In the case of self-attention, the set of comparisons are to other elements within a given sequence. The simplest form of comparison between elements in a self-attention layer is a dot product:
$score({x_i}, {x_j}) = {x_i}· {x_j}$
The larger the value the more similar the vectors that are being compared. Then to make effective use of these scores, we’ll normalize them with a softmax to create a vector of weights, $α_{ij}$, that indicates the proportional relevance of each input to the input element i that is the current focus of attention.
$α_{ij} = \frac{exp(score({x_i}, {x_j}))}{\sum_{k=1}^i exp(score({x_i}, {x_k}))}$, ${\forall}$ j ≤ i
Given the proportional scores in α, we then generate an output value yi by taking the sum of the inputs seen so far, weighted by their respective α value. 
${y_i} =\sum_{j≤i} α_{ij}{x_j}$

Self-attention layers' first approach 

The transformer model can also be used for the contextual generation task and text summarization task.

During the contextual generation, the model is given some prefix text and will output a possible completion to it. The transformer model can have direct access to all the prefix text and the subsequently generated output of its own.

As for the text summarization task, the training set contains multiple full-length articles accompanied by their summaries with a unique marker separating these two parts, where one training unit is like $$(x_1,...,x_m,δ,y_1,...y_n)$$. Teacher-forcing also applies during the training.


Transformers in contextual generation and summarization


https://huggingface.co/docs/transformers/model_summary

Huggingface Model Summary

Lin, Tianyang & Wang, Yuxin & Liu, Xiangyang & Qiu, Xipeng. (2021). A Survey of Transformers. 

A Survey of Transformers (Lin et. al, 2021)

Transformers are sequence-to-sequence models that consist of an encoder and a decoder, each of which is a stack of *L* identical blocks

Each encoder block is composed of a multi-head self-attention module and position wise feed-forward network (FFN).

Each decoder block is also composed of a multi-head self-attention module and FFN, but with the added components of having cross-attention from the encoder and masked self-attention

Overview of a Transformer

- Encoder-Decoder: sequence to sequence (language modeling)
- Encoder Only: outputs of the encoder are utilized as a representation for the input sequence. This is usually used for classification or sequence labeling problems (i.e. BERT)
- Decoder Only: cross-attention module is removed; this is typically used for sequence generation, such as language modeling (i.e. GPT)

Model Usage of Transformers

- Multi-head self-attention: multiple attention projections are computed and then concatenated into a single $D_m$ representation

- Masked attention: self-attention modules in the decoder are adapted to prevent each position from attending to subsequent position

- Cross-attention: in the decoder, the queries are projected from the outputs of the previous (decoder) layer, whereas the keys and values are projected using the outputs of the encoder

Attention in vanilla Transformers

X-formers are improvements to vanilla transformers.  There variants seek improvement from the perspectives of model efficiency (decrease memory and computation complexity), model generalization, and model adaptation

Transformer Variants (X-formers)

The pre-training and fine-tuning paradigm represents a major shift in modern AI, particularly in NLP. This two-stage process begins with pre-training general-purpose model components on vast amounts of unlabeled text using self-supervised objectives. These resulting 'foundation models' are then adapted for specific downstream applications through fine-tuning. This methodology has fundamentally altered the field by reducing the need for large, task-specific supervised datasets and has driven sweeping advances not only in NLP but also in computer vision and other AI domains, enabling the development of powerful systems for understanding, generation, and reasoning.

The Pre-training and Fine-tuning Paradigm

Within Natural Language Processing, pre-trained models based on the Transformer are commonly categorized by their underlying architecture. These primary categories, which are targets for self-supervised pre-training approaches, include encoder-only, decoder-only, and encoder-decoder structures.

Architectural Categories of Pre-trained Transformers

A Transformer block is the fundamental component of a Transformer network. Each block typically contains two main sub-layers: a self-attention module and a feed-forward network (FFN). These sub-layers are often structured using a post-norm architecture, which involves a residual connection followed by layer normalization. The operation can be generalized as `output = LNorm(F(input) + input)`, where `F` is the function of the sub-layer (e.g., self-attention or FFN). This design, where the input and output dimensions are matched, allows multiple blocks to be stacked to create deep networks. The specific operations for each sub-layer are: \n$z = LayerNorm(x+SelfAttn(x))$ \n$y = LayerNorm(z+FFNN(z))$

Transformer Blocks and Post-Norm Architecture

Model depth, denoted by the hyperparameter L, specifies the number of layers stacked within a Transformer architecture. Increasing the model's depth is a key method for boosting its expressive power. For example, BERT models typically use a depth of L=12 for the base version and L=24 for the large version, and even deeper networks can be constructed for further improvements.

Model Depth (L) in Transformers

Transformer models can be trained as language models using a maximum likelihood objective, which is equivalent to minimizing the cross-entropy loss over a training sequence. This training is approached as a standard neural network optimization problem, typically solved using gradient descent algorithms available in common deep learning toolkits. Unlike recurrent models, Transformers process training sequences in parallel, and teacher-forcing is applied during the training process.

Learn Before

Related

Learn After