When training a recurrent neural network, minibatches are initially sampled with the shape (batch size, number of time steps). Applying one-hot encoding to each input token transforms this minibatch into a three-dimensional tensor with the shape (batch size, number of time steps, vocabulary size). To update the hidden states efficiently time step by time step, this tensor is commonly transposed so the outermost dimension is the time step, resulting in an output shape of (number of time steps, batch size, vocabulary size).

Minibatch Shape Transformation for RNNs

In language modeling, representing a token by its scalar index is ineffective because numerical proximity does not equate to semantic similarity (for instance, the 45th and 46th words are not necessarily related in meaning). Instead, each token is represented using a one-hot encoding: a vector with a length equal to the vocabulary size, denoted as $$ N $$. In this vector, the entry corresponding to the token's specific index is set to $$ 1 $$, while all other entries are set to $$ 0 $$. For example, with a vocabulary of five elements, the index $$ 2 $$ would be represented as the one-hot vector $$ [0, 0, 1, 0, 0] $$.

Claude

Google

Because most classification problems lack a natural ordering among classes, categorical labels are typically represented using a one-hot encoding. A one-hot encoding is a vector with as many components as there are categories, where the element corresponding to the true category is set to $$ 1 $$ and all other elements are set to $$ 0 $$. For example, a label for three classes might be represented as the three-dimensional vector $$ (1, 0, 0) $$.

One-Hot Encoding

Dive into Deep Learning

To estimate class probabilities across multiple categories, a classification model requires an output for each class. Using a linear model, this is achieved by defining an affine function for each output. For $$d$$ input features and $$q$$ output categories, the unnormalized outputs (logits) $$\mathbf{o}$$ are computed as $$\mathbf{o} = \mathbf{W}\mathbf{x} + \mathbf{b}$$, where the weight matrix $$\mathbf{W} \in \mathbb{R}^{q 	imes d}$$ contains $$q 	imes d$$ scalars and the bias $$\mathbf{b} \in \mathbb{R}^q$$ contains $$q$$ scalars. Because every output depends on every input feature, this computation represents a fully connected layer in a single-layer neural network.

Linear Model for Multi-Class Classification

One-Hot Encoding for Language Model Tokens

One-hot word vectors are fundamentally limited because they cannot accurately express the semantic similarity between different words. Since the vectors for any two distinct words are orthogonal, their cosine similarity is always $$0$$. Consequently, one-hot vectors completely fail to encode any meaningful relationships or similarities among words, making them a poor choice for word representation.

Learn Before

Related

Learn After