1Cademy - Introduce weight matrices in the transformer

Learn Before

Concept

Introduce weight matrices in the transformer

After introducing weight matrices to transformers, the output calculation for yi is now based on a weighted sum over the value vectors. To avoid effective loss of gradients during training, the dot product needs to be scaled in a suitable fashion. $score(x_{i},x_{j}) = \frac{q_{i}·k_{j}}{√d_{k}}$ , where $q_{i}$ is the query vector, $k_{i}$ is the preceding element’s key vectors, and ${d_k}$ is the dimensionality of the query and key vectors. Taking this one step further, we can scale these scores, take the softmax, and then multiply the result by V resulting in a matrix of shape N ×d: a vector embedding representation for each token in the input. And we'll get the self-attention of transformer from the previous node's above node. Since at each layer we need to compute dot products between each pair of tokens in the input, it is extremely expensive for the input to a transformer to consist of long documents.

0

1

Updated 2026-05-02

Contributors are:

JM

Who are from:

University of Michigan - Ann Arbor

🏆 2

Google

✔️ 1

References

Learn Before

Related

Learn After