Awesome Multimodal Pre-training:

A curated list of Multimodal Pretrained Models and related area.

Survey Papers

2022

####arxiv 2022

VLP: A Survey on Vision-Language Pre-training. [paper]
A Survey of Vision-Language Pre-Trained Models. [paper]

Papers

2022

####arxiv 2022

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. [paper]
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. [paper]
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks. [paper]
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. [paper] [code]
MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment. [paper]
VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training. [paper]
Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. [paper]
Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework. [paper] [paper]
HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning. [paper] [code]
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. [paper] [code]
Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing. [paper]
iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition. [paper]
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks. [paper]
PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map. [paper]
A Multi-level Alignment Training Scheme for Video-and-Language Grounding. [paper]
Contrastive Language-Action Pre-training for Temporal Localization. [paper]
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval. [paper]
CoCa: Contrastive Captioners are Image-Text Foundation Models. [paper]
i-Code: An Integrative and Composable Multimodal Learning Framework. [paper]
Language Models Can See: Plugging Visual Controls in Text Generation. [paper] [code]
Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP). [paper] [code]
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining. [paper]
Flamingo: a Visual Language Model for Few-Shot Learning. [paper]
Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework. [paper]
One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code. [paper]
Unsupervised Prompt Learning for Vision-Language Models. [paper] [code]

CVPR 2022

Vision-Language Pre-Training with Triple Contrastive Learning. [paper] [code]
Multi-modal Alignment using Representation Codebook. [paper]
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment. [paper] [code]
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions. [paper] [code]
Towards General Purpose Vision Systems. [paper] [code]
Are Multimodal Transformers Robust to Missing Modality? [paper]

ICLR 2022

How Much Can CLIP Benefit Vision-and-Language Tasks? [paper] [code]
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. [paper] [code]
Evaluating language-biased image classification based on semantic representations. [paper]
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. [paper]
FILIP: Fine-grained Interactive Language-Image Pre-Training. [paper]

2021

arxiv 2021

Learning to Prompt for Vision-Language Models. [paper] [code]
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models. [paper]
NU¨ WA: Visual Synthesis Pre-training for Neural visUal World creAtion. [paper]
Prompting Visual-Language Models for Efficient Video Understanding. [paper]
A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision. [paper]
ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation. [paper]
Sound and Visual Representation Learning with Multiple Pretraining Tasks. [paper]
Self-Training Vision Language BERTs with a Unified Conditional Model. [paper]
Distilled Dual-Encoder Model for Vision-Language Understanding. [paper]

NIPS 2021

Multimodal Few-Shot Learning with Frozen Language Models. [paper] [code]
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer. [paper] [code]
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation. [paper] [code]

EMNLP 2021

Data Efficient Masked Language Modeling for Vision and Language. [paper] (Findings)
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. [paper]

ICCV 2021

LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision. [paper]
UniT: Multimodal Multitask Learning with a Unified Transformer. [paper] [code]

ACMMM 2021

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration. [paper]
Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training. [paper] [code]
Knowledge Perceived Multi-modal Pretraining in E-commerce. [paper]

ICML 2021

Learning Transferable Visual Models From Natural Language Supervision. [paper] [code]
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. [paper]
Unifying Vision-and-Language Tasks via Text Generation. [paper] [code]
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. [paper] [code]

ACL 2021

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. [paper]
KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation. [paper]
E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. [paper]
Multi-stage Pre-training over Simplified Multimodal Pre-training Models comment: <> (- VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding.)
UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. [paper] [code]

IJCAI 2021

Image Captioning

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. [paper]
UIBert: Learning Generic Multimodal Representations for UI Understanding. [paper]

NAACL 2021

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. [paper]
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions. [paper]
Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. [paper]
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. [paper]

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. [paper]

CVPR 2021

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. [paper] [code]
VinVL: Revisiting Visual Representations in Vision-Language Models. [paper] [code]

AAAI 2021

Single Stream

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. [paper]

ToMM 2021

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training. [paper]

2020

EMNLP 2020

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision. [paper] [code]

NIPS 2020

Image Captioning

RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning. [paper]
Diverse Image Captioning with Context-Object Split Latent Spaces. [paper]
Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [paper]

ACMMM 2020

Image Captioning

Structural Semantic Adversarial Active Learning for Image Captioning. [paper]
Iterative Back Modification for Faster Image Captioning. [paper]
Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [paper]
Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [paper]
Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [paper]
ICECAP: Information Concentrated Entity-aware Image Captioning. [paper]
Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [paper]

Text Captioning

Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [paper]

Video Captioning

Controllable Video Captioning with an Exemplar Sentence. [paper]
Poet: Product-oriented Video Captioner for E-commerce. [paper] [code]
Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [paper]
Relational Graph Learning for Grounded Video Description Generation. [paper]

ECCV 2020

Single-Stream

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. [paper] [code]
UNITER: UNiversal Image-TExt Representation Learning. [paper] [code]

IJCAI 2020

Image Captioning

Human Consensus-Oriented Image Captioning. [paper]
Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. [paper]
Recurrent Relational Memory Network for Unsupervised Image Captioning. [paper]

Video Captioning

Learning to Discretely Compose Reasoning Module Networks for Video Captioning. [paper] [code]
SBAT: Video Captioning with Sparse Boundary-Aware Transformer. [paper]
Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. [paper]

ACL 2020

Image Captioning

Clue: Cross-modal Coherence Modeling for Caption Generation. [paper]
Improving Image Captioning Evaluation by Considering Inter References Variance. [paper]
Improving Image Captioning with Better Use of Caption. [paper] [code]

Video Captioning

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. [paper] [code]

CVPR 2020

12-in-1: Multi-Task Vision and Language Representation Learning. [paper] [code]
Visual commonsense r-cnn. [paper] [code]

arxiv 2020

Single-Stream

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. [paper]
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. [paper]

ICLR 2020

Single-Stream

VL-BERT: Pre-training of Generic Visual-Linguistic Representations. [paper] [code]

AAAI 2020

Single-Stream

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. [paper]

arxiv 2019

Single-Stream

VisualBERT: A Simple and Performant Baseline for Vision and Language. [paper] [code]

2019

arxiv 2019

Single-Stream

VisualBERT: A Simple and Performant Baseline for Vision and Language. [paper] [code]

EMNLP 2019

Cross-Stream

LXMERT: Learning Cross-Modality Encoder Representations from Transformers. [paper] [code]

NIPS 2019

Cross-Stream

ViLBERT: Pretraining Task-Agnostic Vision-linguistic Representations for Vision-and-Language Tasks. [paper] [code]

VQA

arxiv 2022

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering. [paper]

VCR

AAAI 2022

SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning. [paper]

Detection

arxiv 2021

RegionCLIP: Region-based Language-Image Pretraining. [paper]

Retrieval

arxiv 2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions. [paper]
ActionCLIP: A New Paradigm for Video Action Recognition. [paper] [code]
CLIP4Caption: CLIP for Video Caption. [paper]
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. [paper] [code]

Reference and Acknowledgement

Awesome-Captioning

Really appreciate for there contributions in this area.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

terry-r123/Awesome-Multimodal-Pretraining

Folders and files

Latest commit

History

Repository files navigation

Awesome Multimodal Pre-training:

Table of Contents

Survey Papers

2022

Papers

2022

CVPR 2022

ICLR 2022

2021

arxiv 2021

NIPS 2021

EMNLP 2021

ICCV 2021

ACMMM 2021

ICML 2021

ACL 2021

IJCAI 2021

Image Captioning

NAACL 2021

CVPR 2021

AAAI 2021

Single Stream

ToMM 2021

2020

EMNLP 2020

NIPS 2020

Image Captioning

ACMMM 2020

Image Captioning

Text Captioning

Video Captioning

ECCV 2020

Single-Stream

IJCAI 2020

Image Captioning

Video Captioning

ACL 2020

Image Captioning

Video Captioning

CVPR 2020

arxiv 2020

Single-Stream

ICLR 2020

Single-Stream

AAAI 2020

Single-Stream

arxiv 2019

Single-Stream

2019

arxiv 2019

Single-Stream

EMNLP 2019

Cross-Stream

NIPS 2019

Cross-Stream

VQA

arxiv 2022

VCR

AAAI 2022

Detection

arxiv 2021

Retrieval

arxiv 2022

Reference and Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages