Learn Before
SwiGLU (Swish-based Gated Linear Unit) Formula
The SwiGLU function is mathematically defined by utilizing the Swish function, , as its internal non-linear activation. The formula is expressed as:
where is the input vector and indicates the element-wise product.

0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Related
SwiGLU (Swish-based Gated Linear Unit) Formula
Applications of SwiGLU in Large Language Models
The family of Gated Linear Unit (GLU) activation functions creates different variants by incorporating a specific non-linear function to control an information 'gate'. Based on this principle, what is the key distinguishing feature of the SwiGLU variant compared to other possible variants in the same family?
Deconstructing the SwiGLU Activation Function
The gating component of the SwiGLU activation function is controlled by a non-linear function that is strictly increasing across its entire domain.
You are reviewing a teammate’s proposed Transforme...
In a transformer feed-forward block, your team is ...
You’re reviewing a PR that changes a transformer b...
You’re debugging a transformer FFN refactor where ...
Explaining a Distribution Shift Caused by Swapping LayerNorm for RMSNorm and GELU for SwiGLU
Choosing an FFN Activation and Normalization Pair Under Deployment Constraints
Diagnosing Training Instability When Changing Normalization and FFN Activations
Interpreting Activation/Normalization Interactions from FFN Telemetry
Root-Cause Analysis of FFN Output Drift After Swapping Normalization and Activation
Selecting a Normalization + FFN Activation Change After Quantization Regressions
Learn After
Analysis of a Gated Activation Function
Consider a simplified SwiGLU activation function where the input vector
his[2, 1]. The learnable parameters are defined as follows:W1 = [[3], [1]],b1 = [0]W2 = [[2], [-1]],b2 = [1]- The Swish activation function is defined as
swish(x) = x * sigmoid(x). - Assume
sigmoid(7) ≈ 0.999.
Given the formula
output = swish(hW1 + b1) ⊙ (hW2 + b2), where⊙is the element-wise product, calculate the output. Which of the following is the correct result?The SwiGLU activation function is defined by the formula:
σ_swish(hW₁ + b₁) ⊙ (hW₂ + b₂). Match each component of this formula to its primary role in the computation.