When training a recurrent neural network, minibatches are initially sampled with the shape (batch size, number of time steps). Applying one-hot encoding to each input token transforms this minibatch into a three-dimensional tensor with the shape (batch size, number of time steps, vocabulary size). To update the hidden states efficiently time step by time step, this tensor is commonly transposed so the outermost dimension is the time step, resulting in an output shape of (number of time steps, batch size, vocabulary size).

Claude

In language modeling, representing a token by its scalar index is ineffective because numerical proximity does not equate to semantic similarity (for instance, the 45th and 46th words are not necessarily related in meaning). Instead, each token is represented using a one-hot encoding: a vector with a length equal to the vocabulary size, denoted as $$ N $$. In this vector, the entry corresponding to the token's specific index is set to $$ 1 $$, while all other entries are set to $$ 0 $$. For example, with a vocabulary of five elements, the index $$ 2 $$ would be represented as the one-hot vector $$ [0, 0, 1, 0, 0] $$.

Learn Before

Related