Learn Before
Analysis of Activation Function Choice in Transformer Architectures
A key architectural decision in influential language models was the choice of activation function. Analyze why the Gaussian Error Linear Unit (GELU) is often considered more suitable than the Rectified Linear Unit (ReLU) for these large, deep neural networks. In your analysis, connect the mathematical properties of GELU to potential benefits during model training and performance.
0
1
Tags
Ch.1 Pre-training - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Analysis in Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
A team is tasked with adapting a large, pre-trained language model to summarize legal documents. One developer designs a method where each summarization request includes a detailed set of instructions and examples of high-quality summaries, which are provided to the original, unchanged model. Another developer uses a large dataset of legal documents and their corresponding summaries to make small, permanent adjustments to the model's internal configuration before deploying it. What is the most significant difference between these two approaches regarding the pre-trained model itself?
Analysis of Activation Function Choice in Transformer Architectures
A researcher is preparing a training example for a language model that uses a prefix-based objective. The goal is for the model to learn to complete the sentence 'The sun is shining brightly in the sky.' after being given the first three words as context. Which of the following options correctly partitions the sentence into a prefix and a subsequent sequence for this task?