A curated list of Multimodal Pretrained Models and related area.
####arxiv 2022
- VLP: A Survey on Vision-Language Pre-training. [paper]
- A Survey of Vision-Language Pre-Trained Models. [paper]
####arxiv 2022
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. [paper]
- data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. [paper]
- CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks. [paper]
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. [paper] [code]
- MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment. [paper]
- VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training. [paper]
- Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. [paper]
- Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework. [paper] [paper]
- HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning. [paper] [code]
- Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. [paper] [code]
- Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing. [paper]
- iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition. [paper]
- Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks. [paper]
- PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map. [paper]
- A Multi-level Alignment Training Scheme for Video-and-Language Grounding. [paper]
- Contrastive Language-Action Pre-training for Temporal Localization. [paper]
- MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval. [paper]
- CoCa: Contrastive Captioners are Image-Text Foundation Models. [paper]
- i-Code: An Integrative and Composable Multimodal Learning Framework. [paper]
- Language Models Can See: Plugging Visual Controls in Text Generation. [paper] [code]
- Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP). [paper] [code]
- PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining. [paper]
- Flamingo: a Visual Language Model for Few-Shot Learning. [paper]
- Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework. [paper]
- One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code. [paper]
- Unsupervised Prompt Learning for Vision-Language Models. [paper] [code]
- Vision-Language Pre-Training with Triple Contrastive Learning. [paper] [code]
- Multi-modal Alignment using Representation Codebook. [paper]
- Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment. [paper] [code]
- Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions. [paper] [code]
- Towards General Purpose Vision Systems. [paper] [code]
- Are Multimodal Transformers Robust to Missing Modality? [paper]
- How Much Can CLIP Benefit Vision-and-Language Tasks? [paper] [code]
- Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. [paper] [code]
- Evaluating language-biased image classification based on semantic representations. [paper]
- SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. [paper]
- FILIP: Fine-grained Interactive Language-Image Pre-Training. [paper]
- Learning to Prompt for Vision-Language Models. [paper] [code]
- CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models. [paper]
- NU¨ WA: Visual Synthesis Pre-training for Neural visUal World creAtion. [paper]
- Prompting Visual-Language Models for Efficient Video Understanding. [paper]
- A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision. [paper]
- ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation. [paper]
- Sound and Visual Representation Learning with Multiple Pretraining Tasks. [paper]
- Self-Training Vision Language BERTs with a Unified Conditional Model. [paper]
- Distilled Dual-Encoder Model for Vision-Language Understanding. [paper]
- Multimodal Few-Shot Learning with Frozen Language Models. [paper] [code]
- VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer. [paper] [code]
- TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation. [paper] [code]
- Data Efficient Masked Language Modeling for Vision and Language. [paper] (Findings)
- Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. [paper]
- LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision. [paper]
- UniT: Multimodal Multitask Learning with a Unified Transformer. [paper] [code]
- ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration. [paper]
- Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training. [paper] [code]
- Knowledge Perceived Multi-modal Pretraining in E-commerce. [paper]
- Learning Transferable Visual Models From Natural Language Supervision. [paper] [code]
- Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. [paper]
- Unifying Vision-and-Language Tasks via Text Generation. [paper] [code]
- ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. [paper] [code]
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. [paper]
- KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation. [paper]
- E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. [paper]
- Multi-stage Pre-training over Simplified Multimodal Pre-training Models comment: <> (- VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding.)
- UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. [paper] [code]
- TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. [paper]
- UIBert: Learning Generic Multimodal Representations for UI Understanding. [paper]
- LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. [paper]
- Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions. [paper]
- Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. [paper]
- Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. [paper]
- DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. [paper]
- M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. [paper] [code]
- VinVL: Revisiting Visual Representations in Vision-Language Models. [paper] [code]
- ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. [paper]
- Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training. [paper]
- Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision. [paper] [code]
- RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning. [paper]
- Diverse Image Captioning with Context-Object Split Latent Spaces. [paper]
- Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [paper]
- Structural Semantic Adversarial Active Learning for Image Captioning. [paper]
- Iterative Back Modification for Faster Image Captioning. [paper]
- Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [paper]
- Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [paper]
- Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [paper]
- ICECAP: Information Concentrated Entity-aware Image Captioning. [paper]
- Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [paper]
- Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [paper]
- Controllable Video Captioning with an Exemplar Sentence. [paper]
- Poet: Product-oriented Video Captioner for E-commerce. [paper] [code]
- Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [paper]
- Relational Graph Learning for Grounded Video Description Generation. [paper]
- Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. [paper] [code]
- UNITER: UNiversal Image-TExt Representation Learning. [paper] [code]
- Human Consensus-Oriented Image Captioning. [paper]
- Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. [paper]
- Recurrent Relational Memory Network for Unsupervised Image Captioning. [paper]
- Learning to Discretely Compose Reasoning Module Networks for Video Captioning. [paper] [code]
- SBAT: Video Captioning with Sparse Boundary-Aware Transformer. [paper]
- Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. [paper]
- Clue: Cross-modal Coherence Modeling for Caption Generation. [paper]
- Improving Image Captioning Evaluation by Considering Inter References Variance. [paper]
- Improving Image Captioning with Better Use of Caption. [paper] [code]
- MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. [paper] [code]
- 12-in-1: Multi-Task Vision and Language Representation Learning. [paper] [code]
- Visual commonsense r-cnn. [paper] [code]
- ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. [paper]
- Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. [paper]
- Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. [paper]
- ViLBERT: Pretraining Task-Agnostic Vision-linguistic Representations for Vision-and-Language Tasks. [paper] [code]
- Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering. [paper]
- SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning. [paper]
- RegionCLIP: Region-based Language-Image Pretraining. [paper]
-
BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions. [paper]
-
ActionCLIP: A New Paradigm for Video Action Recognition. [paper] [code]
-
CLIP4Caption: CLIP for Video Caption. [paper]
-
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. [paper] [code]
Really appreciate for there contributions in this area.