Learn Before
Multi-Modal PTMs
A great majority of these models are designed for a general visual and linguistic feature encoding. These models are pre-trained on some huge corpus of cross-modal data, such as videos with spoken words or images with captions, incorporating extended pre-training tasks to fully utilize the multi-modal feature.
Typically, tasks like visual-based MLM, masked visual-feature modeling and visual-linguistic matching are widely used in multi-modal pre-training.
-
Video-Text PTMs: To obtain sequences of visual and linguistic tokens used for pre-training, the videos are pre-processed by CNN-based encoders and off-the-shelf speech recognition techniques, respectively. And a single Transformer encoder is trained on the processed data to learn the vision-language representations for downstream tasks like video caption.
-
Image-Text PTMs: several works introduce PTMs on image-text pairs, aiming to fit downstream tasks like visual question answering(VQA) and visual commonsense reasoning(VCR).
-
Audio-Text PTMs: Several methods have explored the chance of PTMs on audio-text pairs, such as SpeechBERT. This work tries to build an end-to-end Speech Question Answering (SQA) model by encoding audio and text with a single Transformer encoder, which is pre-trained with MLM on speech and text corpus and fine-tuned on Question Answering.
0
1
Tags
Data Science