Multiple Choice

An engineer is designing a neural network for a large language model and observes that the two-layer Feed-Forward Network (FFN) component is the primary computational bottleneck during training. The design specifies that the FFN's internal hidden layer dimension must be significantly larger than its input and output dimensions to ensure high model capacity. Given the goal of reducing the computational cost of the FFN while preserving its expressive power, which of the following design choices for the non-linear activation function (applied after the first linear layer) would be most effective?

0

1

Updated 2025-09-28

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science