The GeGLU (GELU-based Gated Linear Unit) activation function is defined by the following formula: 

$\sigma_{\text{geglu}}(\mathbf{h}) = \sigma_{\text{gelu}}(\mathbf{hW}_1 + \mathbf{b}_1) \odot (\mathbf{W}_2 + \mathbf{b}_2)$ 

In this equation, $\mathbf{h}$ represents the input, while $\mathbf{W}_1, \mathbf{W}_2, \mathbf{b}_1$, and $\mathbf{b}_2$ are learnable model parameters (weights and biases). The function $\sigma_{\text{gelu}}$ is the Gaussian Error Linear Unit (GELU) activation, and $\odot$ signifies the element-wise product.

GeGLU (GELU-based Gated Linear Unit) Formula

The GeGLU (GELU-based Gated Linear Unit) activation function is utilized in the architecture of modern Large Language Models. For instance, the Gemma family of models incorporates GeGLU.

Applications of GeGLU in Large Language Models

GeGLU is a variant of the Gated Linear Unit (GLU) family of activation functions. It is created by specifying the internal non-linear activation function, $\sigma(\cdot)$, to be the Gaussian Error Linear Unit (GELU). This choice distinguishes GeGLU from other GLU variants.

Google

The Gated Linear Unit (GLU) is a family of activation functions commonly used in Large Language Models (LLMs). The specific function within this family is determined by the choice of its internal non-linear activation function, denoted as `σ`. Varying this function, for instance by using GELU or Swish, results in different GLU variants such as GeGLU and SwiGLU. For more in-depth information on GLUs, the work of Shazeer (2020) is a key reference.

Gated Linear Unit (GLU)

Reference of Foundations of Large Language Models Course

The Gated Linear Unit (GLU) activation function is defined by the formula: 

$\sigma_{\text{glu}}(h) = \sigma(hW_1 + b_1) \odot (W_2 + b_2)$ 

Here, $h$ is the input. The parameters $W_1, W_2$ (weights) and $b_1, b_2$ (biases) are real numbers ($W \in \mathbb{R}, b \in \mathbb{R}$). The function $\sigma$ represents a non-linear activation function, and $\odot$ denotes the element-wise product.

Gated Linear Unit (GLU) Formula

GeGLU (GELU-based Gated Linear Unit)

SwiGLU is a variant within the Gated Linear Unit (GLU) family of activation functions. It is formed by using the Swish function as the internal non-linear activation, denoted as σ(·) in the general GLU formula.

Learn Before

Related

Learn After