The standard procedure for processing an input sequence with a Transformer encoder begins by representing each input token, $$x_i$$, as its corresponding embedding, $$\mathbf{e}_i$$. This sequence of embeddings, $$\mathbf{e}_{0},...,\mathbf{e}_m$$, is then fed into the encoder. The encoder processes this input to produce a sequence of contextualized output vectors, or hidden states, $$\mathbf{h}_{0},...,\mathbf{h}_m$$.

Standard Transformer Encoding Procedure

The architecture of a Transformer encoder is defined by several essential hyperparameters. These include the vocabulary size ($$|V|$$) and the embedding size ($$d_e$$) used for token representations. Additionally, the hidden size ($$d$$) specifies the input and output dimensionality for both the self-attention and the Feed-Forward Network (FFN) sub-layers. Other crucial hyperparameters are the number of attention heads ($$n_{\mathrm{head}}$$) for the multi-head self-attention mechanism, the internal hidden layer size of the FFN ($$d_{\textrm{ffn}}$$), and the model depth ($$L$$), which indicates the number of stacked layers.

Key Hyperparameters of a Transformer Encoder

This example illustrates the encoding process for a masked bilingual sentence pair. The input sequence, `[CLS] [MASK]是 [MASK]动物。 [SEP] Whales [MASK] [MASK] . [SEP]`, is first converted into a series of token embeddings, denoted as `e0` through `e11`. This embedding sequence is then processed by a Transformer Encoder, which outputs a corresponding sequence of contextualized hidden states, `h0` through `h11`. These hidden states serve as the basis for predicting the original masked tokens: '鲸鱼', '哺乳', 'are', and 'mammals'.

Transformer Encoding of a Masked Bilingual Sentence Pair

Prefix tuning is a parameter-efficient fine-tuning (PEFT) technique for large language models. Instead of fine-tuning all the model's parameters, it keeps the original model frozen and introduces a small number of trainable vectors, called a prefix. This prefix is prepended to the sequence of hidden states at each transformer layer, and only these prefix parameters are optimized during training for a specific task. This approach allows the model to be adapted to new tasks by learning a small, task-specific prefix that steers the behavior of the larger frozen model.

Prefix Tuning

In a sequence-to-sequence model, the input is processed by a stack of six encoder layers that have identical structures. A proposal is made to modify this architecture so that all six encoder layers share the exact same set of weights, with the goal of reducing the total number of model parameters. Which statement best analyzes the primary consequence of this change on the model's ability to process information?

A sentence is fed into the encoder side of a Transformer model. Arrange the following steps in the correct sequence to describe how the initial input is processed by the stack of encoders.

Given the following scenario, identify the most probable limitation of the current encoder design and propose a specific, justifiable change to the encoder stack to improve its ability to capture long-range dependencies.

Improving a Transformer's Contextual Understanding

Unlike Recurrent Neural Networks (RNNs) that process tokens sequentially one-by-one, self-attention ditches sequential operations in favor of parallel computation and does not naturally preserve the order of the input sequence. To address this order-insensitivity, the dominant approach is to represent the sequence order as an additional input associated with each token, called a positional encoding. These encodings can be either learned during training or fixed a priori.

Positional Encoding

Every individual layer within the Transformer encoder stack contains two primary sublayers: a multi-head self-attention pooling sublayer and a positionwise feed-forward network. In the encoder's self-attention mechanism, the queries, keys, and values are all sourced directly from the outputs of the immediately preceding encoder layer.

Transformer Encoder Sublayers

The Transformer encoder is structured as a stack of multiple identical layers. It processes an input sequence that has been combined with positional encoding, and ultimately outputs a $$d$$-dimensional vector representation for every position within that source sequence.

University of California, Berkeley

Claude

Google

The concept of attention helped dramatically improve the seq_to_seq model. There was a lot of improvement and development on that concept since those two main papers on attention got released. One of them is the paper “Attention is All you Need.” The paper introduced a model called “Transformer” that uses only attention mechanisms to work with sequential data - no RNNs. The model they created is subject to parallelization and works very well with GPU. One of the main problems with RNN is that the algorithm is hardly parallelizable because before we can evaluate one time stamp in the encoder we need the previous one. You will see how we can easily parallelize the Transformer model. As before let’s consider the example of machine translation. As before the model consists from two parts:
- Encoders
- Decoders


Transformer model

A very good article on transformer model. A very in-depth analysis

http://jalammar.github.io/illustrated-transformer/

The Illustrated Transformer

A very influential paper that introduced the concept of Transformer model.
https://arxiv.org/abs/1706.03762

Attention Is All You Need

A very good video explaining the transformer model:
https://www.youtube.com/watch?v=rBCqOTEfxvg

Attention is all you need; Attentional Neural Network Models | Łukasz Kaiser | Masterclass

Reference of Foundations of Large Language Models Course

Dive into Deep Learning

Here on the image you can see the structure of the decoder which is very similar to the encoder part we just described with only difference that in this case we pass K, V from the input to the each decoder attention layer(encoder -decoder layer) and now we take queries from previous decoder layers and compare the decoder query with encoder keys just like in the usual se2seq model. Also the difference is that we have here a layer of so called masked self-attention. It is just the layer at each time stamp we do not compare the query with future keys

Transformer Decoder

https://www.youtube.com/watch?v=S27pHKBEp30

Learn Before

Related

Learn After