Short Answer

Hyperparameter Tuning Trade-offs

An engineer is trying to increase the capacity of a Transformer encoder to better handle a complex language task. They are considering two options:

Option A: Double the hidden size (dd). Option B: Double the number of attention heads (nheadn_{head}).

Compare these two options. Explain the likely impact of each on the model's learning capabilities and computational cost. Which option would you recommend if the primary goal is to allow the model to focus on more varied and nuanced relationships within the input text? Justify your choice.

0

1

Updated 2025-10-03

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science