Learn Before
Concept

Multi-Modal PTMs

A great majority of these models are designed for a general visual and linguistic feature encoding. These models are pre-trained on some huge corpus of cross-modal data, such as videos with spoken words or images with captions, incorporating extended pre-training tasks to fully utilize the multi-modal feature.

Typically, tasks like visual-based MLM, masked visual-feature modeling and visual-linguistic matching are widely used in multi-modal pre-training.

  • Video-Text PTMs: To obtain sequences of visual and linguistic tokens used for pre-training, the videos are pre-processed by CNN-based encoders and off-the-shelf speech recognition techniques, respectively. And a single Transformer encoder is trained on the processed data to learn the vision-language representations for downstream tasks like video caption.

  • Image-Text PTMs: several works introduce PTMs on image-text pairs, aiming to fit downstream tasks like visual question answering(VQA) and visual commonsense reasoning(VCR).

  • Audio-Text PTMs: Several methods have explored the chance of PTMs on audio-text pairs, such as SpeechBERT. This work tries to build an end-to-end Speech Question Answering (SQA) model by encoding audio and text with a single Transformer encoder, which is pre-trained with MLM on speech and text corpus and fine-tuned on Question Answering.

0

1

Updated 2022-05-20

Tags

Data Science