Concept

Self-Attention layer understanding - Step 5 - Adding the time

So as you may have noticed in the current state the order of the words do not matter at all. We can permute the sentence but the result would be the same. In this case instead of using RNN to account for the order we can calculate positional encoding for the each timestamp and just add it to the word embeddings(note that we do it once right after the embedding layer). That positional encoding is calculated so that projected vectors into Q/K/V vectors have some meaning full distance in between them. Here is the example of how to it is calculated for the 20 words (rows) with an embedding size of 512 (columns)

Image 0

0

1

Updated 2025-10-06

Tags

Data Science

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related