1Cademy - Determining Weight Matrix Dimensions in Multi-Head Attention

Learn Before

Multi-Head Attention Output Formula (General Vector Form)

Short Answer

Determining Weight Matrix Dimensions in Multi-Head Attention

A multi-head attention layer is configured with 12 parallel attention heads. The output of each head is a 64-dimensional vector. After concatenating the outputs from all heads, the resulting vector is multiplied by a final weight matrix to produce the layer's final 768-dimensional output vector. What must be the dimensions of this final weight matrix?

Updated 2025-10-08

Contributors are:

Who are from:

Learn Before

Related