Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
In a Transformer architecture, the Feed-Forward Network (FFN) sub-layer is typically implemented as a two-layer network. The standard mathematical formulation for this FFN is:
Here, is the input vector. The network's parameters consist of:
- and : The weight matrix and bias vector for the initial linear transformation.
- and : The weight matrix and bias vector for the subsequent linear transformation.
The dimension represents the input and output size, whereas indicates the hidden layer's size. The function is the non-linear activation function utilized in the hidden layer, with the Rectified Linear Unit (ReLU) being a widespread choice.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers
An engineer is building a deep neural network for sequence processing. Each layer of the network consists of a self-attention mechanism followed by a position-wise sub-layer. The engineer designs this position-wise sub-layer to be composed of two consecutive linear transformations. What is the most significant negative consequence of omitting a non-linear activation function between these two linear transformations?
Analysis of a Position-Wise Sub-Layer
A researcher modifies the position-wise sub-layer within a sequence processing model. The standard design for this sub-layer is a sequence of: a linear transformation, a non-linear activation, and a second linear transformation. The researcher's modification adds a second non-linear activation function immediately after the final linear transformation. Which of the following best evaluates the impact of this architectural change?
FFN Hidden Size in Transformers
Learn After
ReLU (Rectified Linear Unit)
Importance of Activation Function Design in Wide FFNs
In a standard two-layer feed-forward network (FFN) within a Transformer, an input vector
hhas a dimension ofd = 512. The network's hidden layer has a dimension ofd_h = 2048. The FFN is defined by the operation:Output = σ(h * W_h + b_h) * W_f + b_f, whereσis a non-linear activation function. What must be the dimensions of the weight matrixW_ffor the output vector to have the same dimension as the input vectorh?Troubleshooting FFN Dimension Mismatch
A standard Feed-Forward Network (FFN) in a Transformer model processes an input vector
hof dimensiondusing the formula:FFN(h) = σ(h * W_h + b_h) * W_f + b_f. The intermediate hidden layer has a dimensiond_h. Match each component from the formula to its correct description.You’re debugging a Transformer block in an interna...
You are reviewing a teammate’s implementation of a...
You’re implementing a single Transformer block in ...
Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)
Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement
Choosing Pre-Norm vs Post-Norm for a Deep Transformer: Stability, Shapes, and Sub-layer Semantics
Root-Cause Analysis of Training Instability After a “Minor” Transformer Block Change
Production Bug Triage: Transformer Block Norm Placement vs Attention/FFN Interface Contracts
Post-Norm vs Pre-Norm Migration: Verifying Tensor Shapes and Correct Sub-layer Wiring
Incident Review: Silent Performance Regression After “Optimization” of a Transformer Block