Learn Before
Applications of GELU in Large Language Models
The Gaussian Error Linear Unit (GELU) activation function has been widely adopted in the architecture of several influential Large Language Models. Notable examples of models that utilize GELU include BERT, GPT-3, and BLOOM.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
GELU (Gaussian Error Linear Unit) Formula
Applications of GELU in Large Language Models
An activation function is defined by its behavior of weighting an input value by that value's corresponding cumulative probability from a standard normal distribution (mean=0, variance=1). Given two inputs,
x = -3andy = 3, which statement best describes their respective outputs,f(x)andf(y)?Hendrycks and Gimpel [2016] on GELU
An activation function is designed to scale its input value by the probability that a randomly drawn value from a standard normal distribution (mean=0, variance=1) is less than or equal to that input. How does this function's output for a small negative input (e.g., -0.1) compare to the output of a function that simply sets all negative inputs to zero?
Activation Function Selection for a Language Model
Diagnosing Training Instability When Changing Normalization and FFN Activations
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions
Interpreting Activation/Normalization Interactions from FFN Telemetry
You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re debugging a transformer FFN refactor where ...
You’re reviewing a PR that changes a transformer b...
Learn After
Analysis of Activation Function Choice in Transformer Architectures
A financial analyst asks a large language model, 'What was the closing stock price for ACME Corp today?' The model, with a knowledge cutoff of last year, responds: 'I cannot provide real-time information. My data is not current.' To make the model more useful for this task without retraining it, it is integrated with an external tool that can access a live stock market data feed. Which statement best analyzes the primary advantage of this approach for this specific problem?
A research team is developing a new large-scale transformer-based language model and is deciding on the activation function for the feed-forward networks. A senior engineer advocates for using the Gaussian Error Linear Unit (GELU). Which statement best evaluates the rationale for this choice, considering its historical application in influential models?