Instead of just using small models to generate synthetic data, one can incorporate knowledge distillation loss based on these models. The knowledge distillation loss, denoted as $$\mathrm{Loss}_{\mathrm{kd}}$$, quantifies the difference between the output probability distributions of a teacher (or small) model and a student (or large) model. It is formally defined using the Kullback-Leibler (KL) divergence as:

$$\mathrm{Loss}_{\mathrm{kd}} = \mathrm{KL}(\mathrm{Pr}^{w}(\cdot|\mathbf{x})\ ||\ \mathrm{Pr}^{s}_{\theta}(\cdot|\mathbf{x}))$$

Here, $$\mathrm{Pr}^{w}(\cdot|\mathbf{x})$$ is the probability distribution produced by the teacher (or weak) model, and $$\mathrm{Pr}^{s}_{\theta}(\cdot|\mathbf{x})$$ is the distribution from the student model with parameters $$\theta$$, given an input $$\mathbf{x}$$. This simple loss function measures the difference between the small and large models and is minimized to encourage the large model to mimic the small model's behavior.

Knowledge Distillation Loss using KL Divergence

Most modern neural language models require a significant amount of memory
for training and inference. These models have to be compressed in order to meet the computation and storage constraints of edge applications. This can be done either by building student models using knowledge distillation or by using model compression techniques. Developing a task-agnostic model compression method is an active
research topic.

Saint Olaf College

+ New Datasets for More Challenging Tasks
+ Modeling Commonsense Knowledge
+ Interpretable DL Models
+ Memory Efficient Models
+ Few-Shot and Zero-Shot Learning

Challenges and opportunities

Although a number of large-scale datasets have been collected for common TC tasks in recent years, there remains a need for new datasets for more challenging TC tasks such as QA with multi-step reasoning, text classification for multi-lingual documents, and TC for extremely long documents.

New Datasets for More Challenging Tasks

Incorporating commonsense knowledge into DL models has the potential to significantly improve model performance, pretty much in the same way that humans leverage commonsense knowledge to perform different tasks. For example, a QA system equipped with a commonsense knowledge base could answer questions about the real world. Commonsense knowledge also helps to solve problems in the case
of incomplete information. Using widely held beliefs about everyday objects or concepts, AI systems can reason based on “default” assumptions about the unknowns in a similar way people do. Although this idea has been investigated for sentiment classification, much more research is required to explore to effectively model and use commonsense knowledge in DL models.

Modeling Commonsense Knowledge

While DL models have achieved promising performance on challenging benchmarks,
most of these models are not interpretable. For example, why does a model outperform another model on one dataset, but underperform on other datasets? What exactly have DL models learned? What is a minimal neural network architecture that can achieve a certain accuracy on a given dataset? Although the attention and self-attention mechanisms provide some insight toward answering these questions, a detailed study of the underlying behavior and dynamics of these models is still lacking. A better understanding of the theoretical aspects of these models can help develop better models curated toward various text analysis scenarios.

Interpretable DL Models

Memory Efficient Models

Most DL models are supervised models that require large amounts of domain labels. In practice, it is expensive to collect such labels for each new domain. Fine-tuning a PLM (e.g., BERT and OpenGPT) to a specific task requires much fewer domain labels than training a model from scratch, thus opening opportunities of developing new zero-shot or few-shot learning methods based on PLMs.

Learn Before

Related

Learn After