Definition

Core Function F()F(\cdot) in Transformer Sub-layers

In a Transformer sub-layer, the primary computation is represented by the mathematical function F()F(\cdot). The specific design of this function varies based on the type of sub-layer. For a Feed-Forward Network (FFN) sub-layer, F()F(\cdot) acts as a multi-layer FFN. Conversely, in a self-attention sub-layer, F()F(\cdot) operates as a multi-head self-attention mechanism, which is standardly formulated as Query-Key-Value (QKV) attention.

Image 0

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related