In language modeling, the model's input tokens and output predictions are drawn from the same vocabulary. As a consequence, both the input representation and the output layer share the same dimensionality, which equals the vocabulary size. This architectural property distinguishes language models from many other sequence-to-sequence tasks where the source and target vocabularies may differ.

Claude

Recurrent neural networks (RNNs) can be utilized to construct character-level language models. In this architecture, the RNN processes a sequence of text by taking the current and all previous characters as context to predict the immediate next character at each time step. The recurrent nature of the network allows the historical information of the text sequence to be retained and used for these character-by-character predictions.

RNN-Based Character-Level Language Models

Dive into Deep Learning

The training process for recurrent neural network (RNN) language models involves running a softmax operation on the output from the output layer at each individual time step to generate a probability distribution. The model's error is then calculated by applying the cross-entropy loss function to compare this output probability distribution against the actual target character for that specific time step.

Training Objective for RNN Language Models

In recurrent neural network (RNN) language models, each input token is typically represented mathematically by a $$d$$-dimensional vector. When processing a minibatch of size $$n > 1$$, the complete input at any given time step $$t$$, denoted as $$\mathbf{X}_t$$, is formatted as an $$n 	imes d$$ matrix, where each row corresponds to the vector representation of a token for one of the sequences in the minibatch.

Input Representation in RNN Language Models

The RNNLMScratch class implements an RNN-based language model from scratch by composing a previously defined RNN module with an output projection layer. It extends a Classifier base class and accepts an RNN instance, the vocabulary size, and a learning rate as constructor arguments. Because a language model's inputs and outputs are drawn from the same vocabulary, both share the same dimensionality, which equals the vocabulary size. The output layer is defined by a learnable weight matrix $$\mathbf{W}_{hq} \in \mathbb{R}^{h 	imes q}$$ (where $$h$$ is the number of hidden units and $$q$$ is the vocabulary size), initialized from a scaled normal distribution, and a bias vector $$\mathbf{b}_q \in \mathbb{R}^{q}$$, initialized to zeros. These parameters project each hidden state to a vector of logits over the vocabulary.

RNNLMScratch Class

Shared Vocabulary for Input and Output in Language Models

A standard recurrent neural network (RNN) language model is structurally composed of three primary stages: input encoding, which transforms raw tokens into mathematical vectors; RNN modeling, which processes the sequence of input vectors to continuously update hidden states; and output generation, which maps the final hidden states to a probability distribution over the vocabulary to predict the next token.

Learn Before

Related