Concept

Dense Scaling

we increase the depth and width of standard Transformer architectures.

Two direction:

The first direction focuses on fitting a larger model on single device by reducing the memory required by activations and optimizer states during the training process. The second direction focuses on efficient training of even larger models through model parallelism e.g. splitting a model across multiple devices.

0

1

Updated 2022-06-05

Tags

Science