Learn Before
Activation function of the FFN in transformers
Vanilla transformers use ReLU activation. Other functions used as activation functions include:
- Swish function -> f(x) = xsigmoid( x)
- Gaussian Error Linear Unit (GELU)
- Gated Linear Units (GLU)
0
1
Tags
Data Science
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Learn After
Gaussian Error Linear Unit (GELU)
Gated Linear Unit (GLU)
A machine learning engineer is analyzing the feed-forward network (FFN) component of a transformer model. They want to replace the standard Rectified Linear Unit (ReLU) activation function with a more modern alternative to potentially improve model performance. Which of the following statements best analyzes the rationale for using a function like the Gaussian Error Linear Unit (GELU) or Swish over ReLU in this context?
Match each activation function, which can be used in the feed-forward network of a transformer model, with its corresponding description.
Evaluating an Activation Function Change in a Transformer FFN