Learn Before
Comparison
Comparison of ELMo, GPT, and BERT
The architectures of ELMo, GPT, and BERT offer distinct approaches to modeling context for downstream adaptation. ELMo provides bidirectional context encoding but requires customized, task-specific architectures for different applications. In contrast, GPT employs a task-agnostic approach that uses the same architecture across tasks, but it is limited to left-to-right context encoding. BERT unites the best of both methodologies: by utilizing a pretrained Transformer encoder, it achieves deep bidirectional context representations while remaining entirely task-agnostic, allowing it to be adapted to a wide variety of tasks with minimal structural changes.
0
1
Updated 2026-05-26
Tags
D2L
Dive into Deep Learning @ D2L