1Cademy - Diagnosing Model Training Issues from Data Formatting

Learn Before

Using [SEP] Tokens for Sequence Concatenation

Short Answer

Diagnosing Model Training Issues from Data Formatting

A data scientist is training a language model for a text summarization task. They have combined the original long text and its corresponding summary into a single sequence for each training example. However, the model is struggling to learn the task and is generating incoherent outputs. Below is an example of how one data point was formatted:

'The Industrial Revolution was the transition to new manufacturing processes... [full long text] ... The Industrial Revolution transformed economies.'

Based on common practices for preparing sequence data, identify the likely error in this formatting and explain why it would cause the model to perform poorly.

0

1

Updated 2025-10-04

Contributors are:

Who are from:

Learn Before

Related