1Cademy - Standard Optimization Objective for Transformer Language Models

Approach 1: Processes the input text sequentially, token by token, updating an internal state that is passed from one step to the next.
Approach 2: Processes all input tokens simultaneously, using a mechanism that directly relates every token to every other token in the input to determine context.

Learn Before

Transformer
Model Parameter Optimization via Loss Minimization
Maximum Likelihood Training Objective for a Dataset of Sequences

Concept

Standard Optimization Objective for Transformer Language Models

The training of Transformer-based language models is generally formulated as a standard neural network optimization task. The goal is to find the optimal model parameters $\hat{\theta}$ by maximizing a likelihood-based objective function over a dataset $\mathcal{D}$ , mathematically expressed as $\hat{\theta} = \argmax_{\theta} \sum_{\mathbf{x} \in \mathcal{D}} \mathcal{L}_{\theta}(\mathbf{x})$ . This optimization process is typically implemented using gradient descent algorithms, which are well-supported by standard deep learning toolkits.