1Cademy - SwiGLU (Swish-based Gated Linear Unit)

Learn Before

Gated Linear Unit (GLU)

Concept

SwiGLU (Swish-based Gated Linear Unit)

The SwiGLU function is a specific variant of the Gated Linear Unit (GLU). It is formulated by adopting the Swish function to serve as the internal non-linear activation, which is generally denoted as $\sigma(\cdot)$ in the GLU architecture.

Updated 2026-04-21

Contributors are:

Who are from:

References

Reference of Foundations of Large Language Models Course
Reference of Foundations of Large Language Models Course

Learn After

SwiGLU (Swish-based Gated Linear Unit) Formula
Applications of SwiGLU in Large Language Models
The family of Gated Linear Unit (GLU) activation functions creates different variants by incorporating a specific non-linear function to control an information 'gate'. Based on this principle, what is the key distinguishing feature of the SwiGLU variant compared to other possible variants in the same family?
Deconstructing the SwiGLU Activation Function
The gating component of the SwiGLU activation function is controlled by a non-linear function that is strictly increasing across its entire domain.
You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re reviewing a PR that changes a transformer b...
You’re debugging a transformer FFN refactor where ...
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Diagnosing Training Instability When Changing Normalization and FFN Activations
Interpreting Activation/Normalization Interactions from FFN Telemetry
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions

Learn Before

Related

Learn After