Learn Before
Training a Deep GRU Language Model
Training a deep Gated Recurrent Unit (GRU) language model involves architectural decisions that closely mirror those of single-layer networks, but with the addition of multiple hidden layers. A stacked GRU architecture can be instantiated by explicitly specifying a nontrivial number of layers (e.g., setting the num_layers parameter to ), while maintaining identical hyperparameters—such as using hidden units and a vocabulary-sized fully connected output layer to map predictions to distinct tokens. This multilayer recurrent block is then embedded within a broader language model framework and optimized using a training loop equipped with techniques like gradient clipping to maintain numerical stability.
0
1
Tags
D2L
Dive into Deep Learning @ D2L