Learn Before
Concept

Transformer Decoder

Here on the image you can see the structure of the decoder which is very similar to the encoder part we just described with only difference that in this case we pass K, V from the input to the each decoder attention layer(encoder -decoder layer) and now we take queries from previous decoder layers and compare the decoder query with encoder keys just like in the usual se2seq model. Also the difference is that we have here a layer of so called masked self-attention. It is just the layer at each time stamp we do not compare the query with future keys

Image 0

0

1

Updated 2025-10-06

Tags

Data Science

Foundations of Large Language Models Course

Computing Sciences