Formula

Probability Distribution Formula for an Encoder-Softmax Language Model

When an encoder model parameterized by θ\theta processes an input sequence x\mathbf{x} and is followed by a Softmax layer parameterized by a weight matrix W\mathbf{W}, it outputs a sequence of probability distributions. This operation is mathematically expressed as: [p1W,θpmW,θ]=SoftmaxW(Encoderθ(x))\begin{bmatrix} \mathbf{p}_1^{\mathbf{W},\theta} \\ \vdots \\ \mathbf{p}_m^{\mathbf{W},\theta} \end{bmatrix} = \mathrm{Softmax}_{\mathbf{W}}(\mathrm{Encoder}_{\theta}(\mathbf{x})) In this formula, each piW,θ\mathbf{p}_i^{\mathbf{W},\theta} represents the conditional output distribution Pr(x)\Pr(\cdot|\mathbf{x}) at sequence position ii. For notation simplicity, the superscripts W\mathbf{W} and θ\theta affixed to each probability distribution are sometimes dropped.

0

1

Updated 2026-05-02

Contributors are:

Who are from:

Tags

Ch.1 Pre-training - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences

Related