Concept

Self-Attention layer understanding - Step 3 - Values

The previous algorithm can work well, but the paper also introduced another modification. In this case we still calculate the scores based on the keys and a query and take the softmax of those. But instead of adding the input vectors multiplied by the scores we add another MLP/matrix called values. We pass all the vectors though this matrix, multiply each by the corresponding score and add all of them up

Image 0

0

1

Updated 2025-09-17

Tags

Data Science