Removing bias terms from affine transformations in models like Feed-Forward Networks has been reported to improve the training stability of Large Language Models (LLMs). This architectural technique has been utilized in several recent models, including LLaMA and Gemma.

Google

A popular model design in Large Language Models (LLMs) is the removal of bias terms in affine transformations. This architectural choice can be applied to several components, including layer normalization, the transformations of inputs to QKV attention mechanisms, and feed-forward networks (FFNs).

Omission of Bias Terms in LLM Affine Transformations

Reference of Foundations of Large Language Models Course

A Feed-Forward Network (FFN) without bias terms modifies the standard FFN structure by omitting the bias parameters in its affine transformations. For an input vector $$\mathbf{h}$$, the computation can be mathematically expressed as:

$$\mathrm{FFN}(\mathbf{h}) = \sigma(\mathbf{h} \mathbf{W}_h) \mathbf{W}_f$$

where $$\mathbf{W}_h$$ and $$\mathbf{W}_f$$ represent the weight matrices, and $$\sigma$$ denotes the non-linear activation function.

Learn Before

Related