Learn Before
Hyperparameter Tuning Trade-offs
An engineer is trying to increase the capacity of a Transformer encoder to better handle a complex language task. They are considering two options:
Option A: Double the hidden size (). Option B: Double the number of attention heads ().
Compare these two options. Explain the likely impact of each on the model's learning capabilities and computational cost. Which option would you recommend if the primary goal is to allow the model to focus on more varied and nuanced relationships within the input text? Justify your choice.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Hidden Size in Transformer Models
A machine learning engineer is designing a Transformer encoder for a complex language task. Their primary goal is to improve the model's ability to capture diverse and varied contextual relationships (e.g., syntactic, semantic, co-reference) from different parts of the input sequence simultaneously. Which hyperparameter adjustment would most directly address this specific goal?
Hyperparameter Tuning Trade-offs
An engineer is configuring a Transformer encoder. Match each key hyperparameter to its specific architectural role.
FFN Hidden Size in Transformers
Vocabulary Size in Transformers
Model Depth in Transformers
Number of Attention Heads
Embedding Size in Transformer Models