Short Answer

Determining Weight Matrix Dimensions in Multi-Head Attention

A multi-head attention layer is configured with 12 parallel attention heads. The output of each head is a 64-dimensional vector. After concatenating the outputs from all heads, the resulting vector is multiplied by a final weight matrix to produce the layer's final 768-dimensional output vector. What must be the dimensions of this final weight matrix?

0

1

Updated 2025-10-08

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Application in Bloom's Taxonomy

Cognitive Psychology

Psychology

Social Science

Empirical Science

Science