Given the following scenario, identify the most probable limitation of the current encoder design and propose a specific, justifiable change to the encoder stack to improve its ability to capture long-range dependencies.

Google

The Transformer encoder is structured as a stack of multiple identical layers. It processes an input sequence that has been combined with positional encoding, and ultimately outputs a $$d$$-dimensional vector representation for every position within that source sequence.

Transformer Encoder Stack

The standard procedure for processing an input sequence with a Transformer encoder begins by representing each input token, $$x_i$$, as its corresponding embedding, $$\mathbf{e}_i$$. This sequence of embeddings, $$\mathbf{e}_{0},...,\mathbf{e}_m$$, is then fed into the encoder. The encoder processes this input to produce a sequence of contextualized output vectors, or hidden states, $$\mathbf{h}_{0},...,\mathbf{h}_m$$.

Standard Transformer Encoding Procedure

The architecture of a Transformer encoder is defined by several essential hyperparameters. These include the vocabulary size ($$|V|$$) and the embedding size ($$d_e$$) used for token representations. Additionally, the hidden size ($$d$$) specifies the input and output dimensionality for both the self-attention and the Feed-Forward Network (FFN) sub-layers. Other crucial hyperparameters are the number of attention heads ($$n_{\mathrm{head}}$$) for the multi-head self-attention mechanism, the internal hidden layer size of the FFN ($$d_{\textrm{ffn}}$$), and the model depth ($$L$$), which indicates the number of stacked layers.

Key Hyperparameters of a Transformer Encoder

This example illustrates the encoding process for a masked bilingual sentence pair. The input sequence, `[CLS] [MASK]是 [MASK]动物。 [SEP] Whales [MASK] [MASK] . [SEP]`, is first converted into a series of token embeddings, denoted as `e0` through `e11`. This embedding sequence is then processed by a Transformer Encoder, which outputs a corresponding sequence of contextualized hidden states, `h0` through `h11`. These hidden states serve as the basis for predicting the original masked tokens: '鲸鱼', '哺乳', 'are', and 'mammals'.

Transformer Encoding of a Masked Bilingual Sentence Pair

Prefix tuning is a parameter-efficient fine-tuning (PEFT) technique for large language models. Instead of fine-tuning all the model's parameters, it keeps the original model frozen and introduces a small number of trainable vectors, called a prefix. This prefix is prepended to the sequence of hidden states at each transformer layer, and only these prefix parameters are optimized during training for a specific task. This approach allows the model to be adapted to new tasks by learning a small, task-specific prefix that steers the behavior of the larger frozen model.

Prefix Tuning

In a sequence-to-sequence model, the input is processed by a stack of six encoder layers that have identical structures. A proposal is made to modify this architecture so that all six encoder layers share the exact same set of weights, with the goal of reducing the total number of model parameters. Which statement best analyzes the primary consequence of this change on the model's ability to process information?

A sentence is fed into the encoder side of a Transformer model. Arrange the following steps in the correct sequence to describe how the initial input is processed by the stack of encoders.

Improving a Transformer's Contextual Understanding

Unlike Recurrent Neural Networks (RNNs) that process tokens sequentially one-by-one, self-attention ditches sequential operations in favor of parallel computation and does not naturally preserve the order of the input sequence. To address this order-insensitivity, the dominant approach is to represent the sequence order as an additional input associated with each token, called a positional encoding. These encodings can be either learned during training or fixed a priori.

Positional Encoding

Every individual layer within the Transformer encoder stack contains two primary sublayers: a multi-head self-attention pooling sublayer and a positionwise feed-forward network. In the encoder's self-attention mechanism, the queries, keys, and values are all sourced directly from the outputs of the immediately preceding encoder layer.

Learn Before

Related