1Cademy - Positionwise Nature of Transformer Feed-Forward Networks

Learn Before

Purpose and Structure of the Feed-Forward Network (FFN) in Transformers

Concept

Positionwise Nature of Transformer Feed-Forward Networks

In a Transformer architecture, the feed-forward network is called positionwise because it applies the identical Multi-Layer Perceptron (MLP) to transform the representation at every sequence position independently. For an input tensor $X$ with the shape (batch size, number of time steps, number of hidden units), this two-layer MLP processes each time step's vector in isolation. Consequently, only the innermost dimension is transformed, resulting in an output tensor of shape (batch size, number of time steps, $ext{ffn\_num\_outputs}$ ). Because the exact same MLP transforms all positions, identical inputs at different positions will produce identical outputs.

Updated 2026-05-15

Contributors are:

Who are from:

References

Dive into Deep Learning

Learn Before

Related