A practical application of BERT-based regression is to calculate the similarity between two sentences. This is achieved by adapting the prediction network, for example by adding a Sigmoid layer, to output a real-valued similarity score.

Sentence Similarity Calculation using BERT-based Regression

This diagram illustrates a general pipeline for applying BERT to text-pair tasks. Two texts are concatenated into a single input sequence, formatted as `[CLS] Text 1 [SEP] Text 2 [SEP]`. This sequence is converted to embeddings ($e_{cls}, e_1, ...$) and processed by BERT to produce hidden states ($h_{cls}, h_1, ...$). The aggregate representation from the `[CLS]` token's hidden state, $h_{cls}$, is then fed into a final prediction network. This network can be configured for different tasks, such as outputting a class label for classification or a real-valued score (e.g., a similarity score) for regression.

Illustration of BERT for Text-Pair Tasks (Classification and Regression)

The standard procedure for training or fine-tuning a BERT-based model for a regression task is to optimize its parameters by minimizing a regression loss function. This loss function quantifies the error between the model's predicted continuous value and the actual ground-truth score.

Training BERT-based Regression Models via Loss Minimization

BERT-based models can be adapted for regression tasks by modifying the prediction network to output a continuous, real-valued score instead of a categorical label distribution. The underlying architecture remains largely the same as that used for classification, with the primary change occurring in the final output layer.

Google

Some examples of applications of PTMs in several classic NLP tasks include:

- General Evaluation Benchmark

- Question Answering

- Sentiment Analysis

- Named Entity Recognition

- Machine Translation

Applications of PTMs

Reference of Foundations of Large Language Models Course

There is an essential issue for the NLP community that how can we evaluate PTMs in a comparable metric. Thus, large scale benchmark is necessary. 

The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks, including single-sentence classification tasks, pairwise text classification tasks, text similarity task, and relevant ranking task. GLUE benchmark is well-designed for evaluating the robustness as well as generalization of models.

General Evaluation Benchmark

Named Entity Recognition (NER) in information extraction and plays an important role in many NLP downstream tasks. 

In deep learning, most of NER methods are in the sequence labeling framework. The entity information in a sentence will be transformed into the sequence of labels, and one label corresponds to one word. The model is used to predict the label of each word.

Named Entity Recognition

Text classification is a primary application for BERT models. The process involves feeding an input text, formatted as a sequence of tokens such as `[CLS] x_1 x_2 ... x_m`, into a BERT model. The model then generates a corresponding sequence of vector representations. For classification, the output vector associated with the initial `[CLS]` token, denoted as $h_{cls}$ or $h_0$, is commonly used to represent the entire text. This single vector is then passed to a prediction network to compute a probability distribution over the possible labels.

Learn Before

Related

Learn After