Concept

Positionwise Nature of Transformer Feed-Forward Networks

In a Transformer architecture, the feed-forward network is called positionwise because it applies the identical Multi-Layer Perceptron (MLP) to transform the representation at every sequence position independently. For an input tensor XX with the shape (batch size, number of time steps, number of hidden units), this two-layer MLP processes each time step's vector in isolation. Consequently, only the innermost dimension is transformed, resulting in an output tensor of shape (batch size, number of time steps, extffn_num_outputs ext{ffn\_num\_outputs}). Because the exact same MLP transforms all positions, identical inputs at different positions will produce identical outputs.

0

1

Updated 2026-05-15

Contributors are:

Who are from:

Tags

D2L

Dive into Deep Learning @ D2L