1Cademy - Generate Additional Labeled Data to overcome Data Sparsity

Learn Before

Methods to overcome sparsity of data in NLP

Concept

Generate Additional Labeled Data to overcome Data Sparsity

Ideal for Task Specific Data. Following are some of the approaches for generating new data.

Data Augmentation: Assumes there are set of existing labeled instances. New instances are generated by modifying features that preserve the label.

Example: Rewriting a sentence with same sentiment to generate new instances. For instance, ‘the film is great’ has a positive sentiment. It can be rewritten as ‘the movie is great’ and ‘the show is awesome’ which retains the same sentiment simultaneously generating new data. Some other ways are verb replacement, tree modification and language models.

Some open issues with this approach: There is no unified framework across tasks and languages as data augmentation requires understanding of task and languages. The performance of this method is equivalent to pre-training according to some studies by cannot be ruled out as it provides interesting insights.

Distance & Weak Supervision: This approach has access to unlabeled text and some form of auxiliary data. Then automatic or semi automatic methods are used for labeling this data. These labeling functions are usually created by experts.

Example: A sentence ‘NCAA will be held in Chicago’ and a table of Location names contains names of different cities. Here when the method is trained on the table, then city in the sentence is identified easily.

Some open issues with this approach: This approach is most popular in information extraction task like Named Entity Recognition and relation extraction. Therefore, the question arises does the task need specific properties to be suitable for distant supervision. Sometimes in low-resource settings, the lack of auxiliary data can make data generation hard. Creating the labeling function needs human interaction and that time can also be spent labeling more data instead.

Use of Non-Experts and Noisy Labels: Non-Native speakers can be used as annotators and Native speakers who are Non-NLP experts can be included for the model development.

Recently developed techniques include noise filtering and noise modeling. Filtering Methods classify noisy and clean Labels to remove incorrect labels. Noise Modeling methods are used to identify the relation between noisy label and clean label such that the noisy can shift to clean distribution.

Updated 2025-09-16

Contributors are:

Vidheesh Kumar Nacode

Who are from:

References

Learn Before

Related