Learn Before
  • Activation function of the FFN in transformers

Gaussian Error Linear Unit (GELU)

The Gaussian Error Linear Unit (GELU) is an activation function often used as an alternative to ReLU, particularly in Large Language Models. It can be conceptualized as a smoother variant of ReLU. Instead of gating inputs based on their sign, GELU weights an input value, hh, by its cumulative distribution function (CDF) percentile from a standard normal distribution, N(0,1)\mathcal{N}(0, 1). This means the function scales its input based on the probability that a randomly drawn value from a standard Gaussian distribution is less than or equal to the input value.

Image 0

0

1

2 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Gaussian Error Linear Unit (GELU)

  • Gated Linear Unit (GLU)

  • A machine learning engineer is analyzing the feed-forward network (FFN) component of a transformer model. They want to replace the standard Rectified Linear Unit (ReLU) activation function with a more modern alternative to potentially improve model performance. Which of the following statements best analyzes the rationale for using a function like the Gaussian Error Linear Unit (GELU) or Swish over ReLU in this context?

  • Match each activation function, which can be used in the feed-forward network of a transformer model, with its corresponding description.

  • Evaluating an Activation Function Change in a Transformer FFN

Learn After
  • GELU (Gaussian Error Linear Unit) Formula

  • Applications of GELU in Large Language Models

  • An activation function is defined by its behavior of weighting an input value by that value's corresponding cumulative probability from a standard normal distribution (mean=0, variance=1). Given two inputs, x = -3 and y = 3, which statement best describes their respective outputs, f(x) and f(y)?

  • Hendrycks and Gimpel [2016] on GELU

  • An activation function is designed to scale its input value by the probability that a randomly drawn value from a standard normal distribution (mean=0, variance=1) is less than or equal to that input. How does this function's output for a small negative input (e.g., -0.1) compare to the output of a function that simply sets all negative inputs to zero?

  • Activation Function Selection for a Language Model

  • Diagnosing Training Instability When Changing Normalization and FFN Activations

  • Choosing an FFN Activation and Normalization Pair Under Deployment Constraints

  • Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU

  • Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation

  • Selecting a Normalization + FFN Activation Change After Quantization Regressions

  • Interpreting Activation/Normalization Interactions from FFN Telemetry

  • You are reviewing a teammate’s proposed Transforme...

  • In a transformer feed-forward block, your team is ...

  • You’re debugging a transformer FFN refactor where ...

  • You’re reviewing a PR that changes a transformer b...