BERT-based Architecture for Span Prediction
A common architecture for span prediction tasks utilizes BERT by concatenating the query and context text into a single input sequence. To identify the optimal answer span, two distinct prediction networks are placed on top of BERT's output layer. For each token within the context text, the first network generates the probability that it marks the start of the answer span (denoted by ), while the second network calculates the probability that it represents the end of the span (denoted by ). These prediction networks are exclusively applied to the outputs corresponding to the context text.
0
1
Tags
Ch.2 Generative Models - Foundations of Large Language Models
Foundations of Large Language Models
Foundations of Large Language Models Course
Computing Sciences
Ch.1 Pre-training - Foundations of Large Language Models
Related
Illustration of BERT-based Architecture for Named Entity Recognition
Training BERT-based NER Models
BERT-based Architecture for Span Prediction
An engineer is using a pre-trained transformer model to build a system that assigns a grammatical tag (e.g., Noun, Verb, Adjective) to every word in a sentence. After the model processes the input and generates a final hidden state vector for each token, which of the following is the most appropriate architectural choice to generate the tag for each specific word?
A developer is building a model to assign a specific category (e.g., 'Person', 'Location', 'Organization') to each word in a sentence. The model's architecture involves using a large, pre-trained component to understand the context of each word. Arrange the following steps in the correct chronological order that describes how this model processes an input sentence to generate a label for each word.
An engineer is building a system to identify and tag specific medical terms (e.g., 'symptom', 'disease', 'medication') within clinical notes. They are using a large, pre-trained transformer-based model that processes an entire sentence and outputs a contextualized vector representation for each input token. Which of the following describes the most effective and standard final layer design for this token-level classification task?
A language model is tasked with answering a question by identifying the correct text span from a given context. The model works by calculating a probability for each token being the 'start' of the answer and a separate probability for each token being the 'end' of the answer. Consider the following scenario:
Context: 'The first modern Olympic Games were held in Athens, Greece, in 1896. The International Olympic Committee (IOC) was founded in 1894 by Pierre de Coubertin.' Question: 'When was the IOC established?'
The model produces the following highest probabilities:
- Highest Probability Start Token: '1896' (Probability: 0.85)
- Highest Probability End Token: '1894' (Probability: 0.91)
Based on this output, what is the most fundamental reason the model failed to produce a valid answer span?
Framing a Clinical Information Extraction Task
Applicability of Span Prediction
BERT-based Architecture for Span Prediction
Learn After
Span Prediction Loss Function
Inference for Span Prediction
Illustration of BERT-based Architecture for Span Prediction
Input Sequence Formatting for Span Prediction
Applying Prediction Networks to Context Token Outputs
An engineer is designing a model to extract answers from a paragraph. The model must identify a continuous segment of text (a 'span') that answers a given question. The model's base component processes the input and produces a contextualized vector representation for each token in the paragraph. Considering the task is to identify the start and end points of the answer, which of the following architectural designs for the final prediction layer is most appropriate?
Debugging a Question-Answering Model Architecture
Comparing Model Architectures for Text Extraction Tasks