In a standard two-layer feed-forward network (FFN) within a Transformer, an input vector h has a dimension of d = 512. The network's hidden layer has a dimension of d_h = 2048. The FFN is defined by the operation: Output = σ(h * W_h + b_h) * W_f + b_f, where σ is a non-linear activation function. What must be the dimensions of the weight matrix W_f for the output vector to have the same dimension as the input vector h?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Application in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
ReLU (Rectified Linear Unit)
Importance of Activation Function Design in Wide FFNs
In a standard two-layer feed-forward network (FFN) within a Transformer, an input vector
hhas a dimension ofd = 512. The network's hidden layer has a dimension ofd_h = 2048. The FFN is defined by the operation:Output = σ(h * W_h + b_h) * W_f + b_f, whereσis a non-linear activation function. What must be the dimensions of the weight matrixW_ffor the output vector to have the same dimension as the input vectorh?Troubleshooting FFN Dimension Mismatch
A standard Feed-Forward Network (FFN) in a Transformer model processes an input vector
hof dimensiondusing the formula:FFN(h) = σ(h * W_h + b_h) * W_f + b_f. The intermediate hidden layer has a dimensiond_h. Match each component from the formula to its correct description.You’re debugging a Transformer block in an interna...
You are reviewing a teammate’s implementation of a...
You’re implementing a single Transformer block in ...
Design a Transformer Block Spec for a New Internal LLM Library (Shapes + Norm Placement)
Diagnosing a Transformer Block Refactor: Attention/FFN Shapes and Norm Placement
Choosing Pre-Norm vs Post-Norm for a Deep Transformer: Stability, Shapes, and Sub-layer Semantics
Root-Cause Analysis of Training Instability After a “Minor” Transformer Block Change
Production Bug Triage: Transformer Block Norm Placement vs Attention/FFN Interface Contracts
Post-Norm vs Pre-Norm Migration: Verifying Tensor Shapes and Correct Sub-layer Wiring
Incident Review: Silent Performance Regression After “Optimization” of a Transformer Block