Learn Before
Optimizing Attention Mechanisms for Different Applications
An engineering team is designing two different language models, both of which will use an attention architecture with 32 query heads. The number of key-value groups can be adjusted to balance performance and computational cost.
- Model A is intended for a real-time translation application on a mobile device, where inference speed and low memory usage are the highest priorities.
- Model B is being trained for a scientific discovery platform to analyze complex research papers, where achieving the maximum possible accuracy is the most important goal, and computational resources are not a major constraint.
For each model, recommend an appropriate number of key-value groups and justify your choice by explaining the trade-off you are making.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Evaluation in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
An engineer is designing a large language model and is deciding on the architecture for its attention layers. The model is configured to have 64 query heads. The engineer uses an attention variant where these query heads are partitioned into groups, and all heads within a group share the same key and value projections. If the engineer sets the number of key-value groups to 1, which statement best analyzes the resulting configuration?
Optimizing Attention Mechanisms for Different Applications
An engineer is configuring an attention layer with 32 query heads. This layer uses a grouped-query approach where query heads are partitioned into groups, with each group sharing a single key and value projection. Match each configuration for the number of key-value groups to its resulting characteristic.