Learn Before
  • Gated Linear Unit (GLU)

SwiGLU (Swish-based Gated Linear Unit)

SwiGLU is a variant within the Gated Linear Unit (GLU) family of activation functions. It is formed by using the Swish function as the internal non-linear activation, denoted as σ(·) in the general GLU formula.

0

1

6 months ago

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related
  • Gated Linear Unit (GLU) Formula

  • GeGLU (GELU-based Gated Linear Unit)

  • SwiGLU (Swish-based Gated Linear Unit)

  • Shazeer [2020] on Gated Linear Units

  • Structural Analysis of Gated Linear Units

  • The Gated Linear Unit (GLU) architecture processes an input through two parallel linear transformations. One of these transformed outputs is then passed through a non-linear function before being combined with the other via an element-wise product. What is the analytical purpose of this non-linearly transformed path in the overall mechanism?

  • A standard feed-forward network layer applies a non-linear activation function after a single linear transformation. The Gated Linear Unit (GLU) architecture, however, processes an input through two parallel linear transformations, where one path acts as a 'gate' for the other after being passed through a non-linear function. What is the primary analytical advantage of this gating mechanism compared to using a single, non-gated activation function?

Learn After
  • SwiGLU (Swish-based Gated Linear Unit) Formula

  • Applications of SwiGLU in Large Language Models

  • The family of Gated Linear Unit (GLU) activation functions creates different variants by incorporating a specific non-linear function to control an information 'gate'. Based on this principle, what is the key distinguishing feature of the SwiGLU variant compared to other possible variants in the same family?

  • Deconstructing the SwiGLU Activation Function

  • The gating component of the SwiGLU activation function is controlled by a non-linear function that is strictly increasing across its entire domain.

  • You are reviewing a teammate’s proposed Transforme...

  • In a transformer feed-forward block, your team is ...

  • You’re reviewing a PR that changes a transformer b...

  • You’re debugging a transformer FFN refactor where ...

  • Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU

  • Choosing an FFN Activation and Normalization Pair Under Deployment Constraints

  • Diagnosing Training Instability When Changing Normalization and FFN Activations

  • Interpreting Activation/Normalization Interactions from FFN Telemetry

  • Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation

  • Selecting a Normalization + FFN Activation Change After Quantization Regressions