we increase the depth and width of standard Transformer architectures.

Two direction:

The first direction focuses on fitting a larger model on single device by reducing the memory required by activations and optimizer states during the training process. The second direction focuses on efficient training of even larger models through model parallelism e.g. splitting a model across multiple devices.

University of Michigan - Ann Arbor

Goal: 

is to build a single model capable of translating 9, 900 language directions covering 100 languages. 

Challenge: 

This creates several challenges for models with insufficient capacity to capture that many languages and scripts adequatel

Components for Scaling Multilingual Translation Models

Dense Scaling

several strategies to increase the capacity of a sequence-to-sequence Transformer model in the context of multilingual machine translation.

Training Large Dense Models

a layer whose parameters are split by language or language group based on similarity in vocabulary.

Each translation direction only accesses a subset of these parameters, allowing the model capacity to scale without significantly affecting the training and inference time.

Learn Before

Related