In a multi-layer Transformer model, the final output representation for an input token is typically generated by averaging the output vectors from all individual layers in the stack.
0
1
Tags
Ch.5 Inference - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Comprehension in Revised Bloom's Taxonomy
Cognitive Psychology
Psychology
Social Science
Empirical Science
Science
Related
Role of the Final Softmax Layer in Transformers
A language model is constructed with a deep stack of 24 identical processing layers, where the output of one layer becomes the input for the next. For the sentence 'The driver turned the steering wheel to park the car', how would the numerical representation for the word 'park' generated by layer 3 likely compare to the representation generated by the final layer, layer 24?
Optimal Representation Extraction
In a multi-layer Transformer model, the final output representation for an input token is typically generated by averaging the output vectors from all individual layers in the stack.