An engineer is designing a neural network for a large language model and observes that the two-layer Feed-Forward Network (FFN) component is the primary computational bottleneck during training. The design specifies that the FFN's internal hidden layer dimension must be significantly larger than its input and output dimensions to ensure high model capacity. Given the goal of reducing the computational cost of the FFN while preserving its expressive power, which of the following design choices for the non-linear activation function (applied after the first linear layer) would be most effective?
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is designing a neural network for a large language model and observes that the two-layer Feed-Forward Network (FFN) component is the primary computational bottleneck during training. The design specifies that the FFN's internal hidden layer dimension must be significantly larger than its input and output dimensions to ensure high model capacity. Given the goal of reducing the computational cost of the FFN while preserving its expressive power, which of the following design choices for the non-linear activation function (applied after the first linear layer) would be most effective?
Evaluating FFN Design Trade-offs in a Resource-Constrained LLM Project
Analyzing the Impact of FFN Width on Activation Function Choice
Computational Impact of Activation Functions in Wide FFNs