Troubleshooting FFN Dimension Mismatch
A Transformer's Feed-Forward Network (FFN) takes an input vector of dimension d = 768 and processes it through a hidden layer of dimension d_h = 3072 before producing an output vector of the same dimension as the input (d = 768). The intermediate vector, after the first linear transformation and activation function, correctly has a dimension of 3072. If a dimension mismatch error occurs during the second linear transformation, what are the required dimensions for the second weight matrix (W_f) to resolve this error? Explain your reasoning based on the rules of matrix multiplication.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
ReLU (Rectified Linear Unit)
Importance of Activation Function Design in Wide FFNs
In a standard two-layer feed-forward network (FFN) within a Transformer, an input vector
hhas a dimension ofd = 512. The network's hidden layer has a dimension ofd_h = 2048. The FFN is defined by the operation:Output = Ï(h * W_h + b_h) * W_f + b_f, whereÏis a non-linear activation function. What must be the dimensions of the weight matrixW_ffor the output vector to have the same dimension as the input vectorh?Troubleshooting FFN Dimension Mismatch
A standard Feed-Forward Network (FFN) in a Transformer model processes an input vector
hof dimensiondusing the formula:FFN(h) = Ï(h * W_h + b_h) * W_f + b_f. The intermediate hidden layer has a dimensiond_h. Match each component from the formula to its correct description.Youâre debugging a Transformer block in an interna...
You are reviewing a teammateâs implementation of a...
Youâre implementing a single Transformer block in ...
Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)
Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement
Choosing Pre-Norm vs Post-Norm for a Deep Transformer: Stability, Shapes, and Sub-layer Semantics
Root-Cause Analysis of Training Instability After a âMinorâ Transformer Block Change
Production Bug Triage: Transformer Block Norm Placement vs Attention/FFN Interface Contracts
Post-Norm vs Pre-Norm Migration: Verifying Tensor Shapes and Correct Sub-layer Wiring
Incident Review: Silent Performance Regression After âOptimizationâ of a Transformer Block