1Cademy - Optimizing Transformer Model Size

Learn Before

FFN Hidden Size in Transformers

Case Study

Optimizing Transformer Model Size

Evaluate the two strategies described in the case study. Which one is more likely to preserve the model's performance, and why? Justify your answer based on the role of the feed-forward network's intermediate layer.

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science

A team of engineers is designing a large neural network for a complex language task. Within each block of their model, they use a sub-network composed of two linear transformations with a non-linearity in between. They are debating whether to make the dimensionality of the intermediate layer in this sub-network significantly larger (e.g., four times larger) than the model's primary embedding and hidden state dimension. What is the primary trade-off they must consider when making this decision?
Optimizing Transformer Model Size
Calculating Parameter Impact of FFN Expansion

Learn Before

Related