Concept

Introduce weight matrices in the transformer

After introducing weight matrices to transformers, the output calculation for yi is now based on a weighted sum over the value vectors. To avoid effective loss of gradients during training, the dot product needs to be scaled in a suitable fashion. score(xi,xj)=qikjdkscore(x_{i},x_{j}) = \frac{q_{i}·k_{j}}{√d_{k}}, where qiq_{i} is the query vector, kik_{i} is the preceding element’s key vectors, and dk{d_k} is the dimensionality of the query and key vectors. Taking this one step further, we can scale these scores, take the softmax, and then multiply the result by V resulting in a matrix of shape N ×d: a vector embedding representation for each token in the input. And we'll get the self-attention of transformer from the previous node's above node. Since at each layer we need to compute dot products between each pair of tokens in the input, it is extremely expensive for the input to a transformer to consist of long documents.

Image 0

0

1

Updated 2026-05-02

Tags

Data Science

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences