Case Study

Optimizing Attention Mechanisms for Different Applications

An engineering team is designing two different language models, both of which will use an attention architecture with 32 query heads. The number of key-value groups can be adjusted to balance performance and computational cost.

  • Model A is intended for a real-time translation application on a mobile device, where inference speed and low memory usage are the highest priorities.
  • Model B is being trained for a scientific discovery platform to analyze complex research papers, where achieving the maximum possible accuracy is the most important goal, and computational resources are not a major constraint.

For each model, recommend an appropriate number of key-value groups and justify your choice by explaining the trade-off you are making.

0

1

Updated 2025-10-05

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Evaluation in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science