1Cademy - Core Function $$F(\cdot)$$ in Transformer Sub-layers

Learn Before

Pre-Norm Architecture in Transformers
Transformer Block Sub-Layers

Definition

Core Function $F(\cdot)$ in Transformer Sub-layers

In a Transformer sub-layer, the primary computation is represented by the mathematical function $F(\cdot)$ . The specific design of this function varies based on the type of sub-layer. For a Feed-Forward Network (FFN) sub-layer, $F(\cdot)$ acts as a multi-layer FFN. Conversely, in a self-attention sub-layer, $F(\cdot)$ operates as a multi-head self-attention mechanism, which is standardly formulated as Query-Key-Value (QKV) attention.