Google

The standard procedure for processing an input sequence with a Transformer encoder begins by representing each input token, $$x_i$$, as its corresponding embedding, $$\mathbf{e}_i$$. This sequence of embeddings, $$\mathbf{e}_{0},...,\mathbf{e}_m$$, is then fed into the encoder. The encoder processes this input to produce a sequence of contextualized output vectors, or hidden states, $$\mathbf{h}_{0},...,\mathbf{h}_m$$.

Standard Transformer Encoding Procedure

A language model's encoder processes an input sequence consisting of 15 tokens. The model is configured with a hidden size of 768. What will be the dimensions of the final sequence of contextualized vectors produced by this encoder?

In the pre-training phase, an encoder model is trained using a self-supervision objective like Masked Language Modeling. The process begins by converting a corrupted input sequence, where some tokens are masked, into a sequence of embeddings. This embedding sequence is then fed into the encoder, which generates contextual vector representations for all input tokens. Finally, these representations are passed to an output layer, such as a Softmax model, which is trained to reconstruct the original masked tokens.

Self-Supervised Pre-training of Encoders via Masked Language Modeling

In the application phase, a pre-trained encoder is adapted for a specific downstream task. The process begins by converting an input sequence of tokens, \{x_0, ..., x_m\}, into their corresponding embeddings, \{e_0, ..., e_m\}. This embedding sequence is then processed by the pre-trained encoder to produce a sequence of rich vector representations. These representations serve as input features for a separate, task-specific prediction network, which in turn generates the final output required for the application.

Applying a Pre-trained Encoder to Downstream Tasks

Arrange the following steps, which describe how a standard Transformer encoder processes a sequence of tokens, into the correct chronological order.

A developer processes the sentence 'The quick brown fox jumps.' through a standard Transformer encoder. They then extract the specific output vector that corresponds to the input token 'fox'. Based on the standard encoding procedure, what does this single vector represent, and how is its value influenced by the other words in the sentence?

Learn Before

Related