Short Answer

Determining the Number of Attention Heads

In a multi-head attention mechanism, the input representation has a dimension of 768. The weight matrix used to compute the 'key' vector for a single attention head has a shape of $768 \times 96$. Assuming the total dimension of the 'key' projection across all heads is equal to the input representation dimension, how many attention heads are being used? Explain your reasoning.

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Analysis in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science