Learn Before
General Formula for a Transformer Layer
In a multi-layer model such as a Transformer, the computation proceeds sequentially through its layers. The output of layer , which is a sequence of hidden states denoted as , serves as the input to the subsequent layer. The transformation is captured by the general formula:
This equation indicates that the hidden states for layer are generated by applying the specific operations of the Layer function (e.g., self-attention, feed-forward network) to the hidden states of layer .

0
1
Tags
Ch.3 Prompting - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Transformer Layer Output Formula
General Formula for a Transformer Layer
Input Composition in a Prefix-Tuned Transformer Layer
A language model is processing an input sentence that has been broken down into 5 distinct tokens. The input to the first processing layer is represented as a matrix containing 5 separate vectors, one for each token. Why is it fundamentally important for the model to maintain this structure—a sequence of individual vectors—as the input to each subsequent layer, rather than, for example, averaging or concatenating them into a single vector?
Structure of a Transformer Layer's Input
When a Transformer model processes a sentence with 12 tokens, the input to the fifth layer is a single, high-dimensional vector that represents the aggregated meaning of the entire sentence as computed by the first four layers.
Learn After
In a standard multi-layer model, the output of a given layer serves as the direct input to the next, creating a sequential chain of processing. Consider an alternative architecture where the input to any given layer (beyond the first) is a combination of the initial input to the entire network and the output of the immediately preceding layer. What is the primary computational difference introduced by this alternative design compared to the standard sequential model?
A multi-layer model processes information sequentially. Given an initial input matrix of hidden states, denoted as , and the outputs of three subsequent layers, , , and , arrange these matrices in the correct order of their generation and processing within the model, from start to finish.
In a deep, multi-layer model, a computational error occurs during the processing of the 5th layer, causing its output matrix of hidden states, , to become corrupted. Based on the standard sequential processing flow where the output of one layer becomes the input for the next, which subsequent layers will be directly impacted by this corrupted data?