Learn Before
  • Purpose and Structure of the Feed-Forward Network (FFN) in Transformers

Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers

The standard Feed-Forward Network (FFN) used in Transformer sub-layers is a two-layer network. Its operation can be expressed by the formula: FFN(h)=σ(hWh+bh)Wf+bf\text{FFN}(\mathbf{h}) = \sigma(\mathbf{h}\mathbf{W}_h + \mathbf{b}_h)\mathbf{W}_f + \mathbf{b}_f In this equation, h\mathbf{h} is the input vector of dimension dd. The components are defined with the following dimensions:

  • First Layer: A linear transformation with a weight matrix WhRd×dh\mathbf{W}_h \in \mathbb{R}^{d \times d_h} and a bias vector bhRdh\mathbf{b}_h \in \mathbb{R}^{d_h}, followed by a non-linear activation function σ\sigma.
  • Second Layer: A linear transformation with a weight matrix WfRdh×d\mathbf{W}_f \in \mathbb{R}^{d_h \times d} and a bias vector bfRd\mathbf{b}_f \in \mathbb{R}^{d}.

The dimension of the hidden layer, dhd_h, is typically larger than the input/output dimension dd. A common choice for the activation function σ()\sigma(\cdot) in the hidden layer is the Rectified Linear Unit (ReLU).

Image 0

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • FFN Hidden Size (dffnd_{ffn}) in Transformers

  • Feed-Forward Network (FFN) Formula and Component Dimensions in Transformers

  • An engineer is building a deep neural network for sequence processing. Each layer of the network consists of a self-attention mechanism followed by a position-wise sub-layer. The engineer designs this position-wise sub-layer to be composed of two consecutive linear transformations. What is the most significant negative consequence of omitting a non-linear activation function between these two linear transformations?

  • Analysis of a Position-Wise Sub-Layer

  • A researcher modifies the position-wise sub-layer within a sequence processing model. The standard design for this sub-layer is a sequence of: a linear transformation, a non-linear activation, and a second linear transformation. The researcher's modification adds a second non-linear activation function immediately after the final linear transformation. Which of the following best evaluates the impact of this architectural change?

Learn After
  • ReLU (Rectified Linear Unit)

  • Importance of Activation Function Design in Wide FFNs

  • In a standard two-layer feed-forward network (FFN) within a Transformer, an input vector h has a dimension of d = 512. The network's hidden layer has a dimension of d_h = 2048. The FFN is defined by the operation: Output = σ(h * W_h + b_h) * W_f + b_f, where σ is a non-linear activation function. What must be the dimensions of the weight matrix W_f for the output vector to have the same dimension as the input vector h?

  • Troubleshooting FFN Dimension Mismatch

  • A standard Feed-Forward Network (FFN) in a Transformer model processes an input vector h of dimension d using the formula: FFN(h) = σ(h * W_h + b_h) * W_f + b_f. The intermediate hidden layer has a dimension d_h. Match each component from the formula to its correct description.

  • You’re debugging a Transformer block in an interna...

  • You are reviewing a teammate’s implementation of a...

  • You’re implementing a single Transformer block in ...

  • Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)

  • Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement

  • Choosing Pre-Norm vs Post-Norm for a Deep Transformer: Stability, Shapes, and Sub-layer Semantics

  • Root-Cause Analysis of Training Instability After a “Minor” Transformer Block Change

  • Production Bug Triage: Transformer Block Norm Placement vs Attention/FFN Interface Contracts

  • Post-Norm vs Pre-Norm Migration: Verifying Tensor Shapes and Correct Sub-layer Wiring

  • Incident Review: Silent Performance Regression After “Optimization” of a Transformer Block