In sequence models like the Transformer, the final input representation for a token is created by summing its semantic embedding with an embedding that encodes its position. The formula is $\mathbf{e}_i = \mathbf{x}_i + \mathbf{PE}(i)$, where $\mathbf{x}_i$ is the token embedding vector for the $i$-th token, $\mathbf{PE}(i)$ is the positional encoding vector for position $i$, and $\mathbf{e}_i$ is the resulting combined vector. This process allows the model to use information about the order of tokens.

Google

The notation $\mathbf{x}_i \in \mathbb{R}^d$ signifies that a data point, indexed by $i$, is represented as a vector $\mathbf{x}_i$. This vector belongs to the set of all $d$-dimensional vectors with real-valued components, denoted by $\mathbb{R}^d$. This is a fundamental representation for inputs in many machine learning models, where $d$ corresponds to the number of features or the dimensionality of an embedding.

Data Point as a d-dimensional Vector

Reference of Foundations of Large Language Models Course

Combining Token and Positional Embeddings

Positional Encoding, denoted as $PE(i)$, is a technique used to inject information about the relative or absolute position of tokens in a sequence. The encoding for a specific position $i$ is represented as a vector in a d-dimensional real number space. This is formally expressed as $PE(i) \in \mathbb{R}^d$, meaning each position is mapped to a unique high-dimensional vector.

Learn Before

Related