1Cademy - Hyperparameter Tuning Trade-offs

Learn Before

Key Hyperparameters of a Transformer Encoder

Short Answer

Hyperparameter Tuning Trade-offs

An engineer is trying to increase the capacity of a Transformer encoder to better handle a complex language task. They are considering two options:

Option A: Double the hidden size ( $d$ ). Option B: Double the number of attention heads ( $n_{head}$ ).

Compare these two options. Explain the likely impact of each on the model's learning capabilities and computational cost. Which option would you recommend if the primary goal is to allow the model to focus on more varied and nuanced relationships within the input text? Justify your choice.

Updated 2025-10-03

Contributors are: