Concept

Role of the Final Softmax Layer in Transformers

In a Transformer architecture composed of L stacked blocks, a final Softmax layer is positioned after the last (L-th) block. The primary function of this layer is to process the final output from the Transformer stack and convert it into a sequence of 'm' probability distributions, where 'm' is the length of the input sequence. Each of these distributions is defined over the entire vocabulary.

0

1

Updated 2026-04-19

Contributors are:

Who are from:

Tags

Ch.2 Generative Models - Foundations of Large Language Models

Foundations of Large Language Models

Foundations of Large Language Models Course

Computing Sciences