Skip to content

terry-r123/Awesome-Multimodal-Pretraining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

Awesome Multimodal Pre-training:Awesome

A curated list of Multimodal Pretrained Models and related area.

Table of Contents

Survey Papers

2022

####arxiv 2022

  • VLP: A Survey on Vision-Language Pre-training. [paper]
  • A Survey of Vision-Language Pre-Trained Models. [paper]

Papers

2022

####arxiv 2022

  • Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. [paper]
  • data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. [paper]
  • CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks. [paper]
  • BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. [paper] [code]
  • MVP: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment. [paper]
  • VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training. [paper]
  • Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. [paper]
  • Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework. [paper] [paper]
  • HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning. [paper] [code]
  • Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. [paper] [code]
  • Making the Most of Text Semantics to Improve Biomedical Vision--Language Processing. [paper]
  • iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition. [paper]
  • Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks. [paper]
  • PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map. [paper]
  • A Multi-level Alignment Training Scheme for Video-and-Language Grounding. [paper]
  • Contrastive Language-Action Pre-training for Temporal Localization. [paper]
  • MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval. [paper]
  • CoCa: Contrastive Captioners are Image-Text Foundation Models. [paper]
  • i-Code: An Integrative and Composable Multimodal Learning Framework. [paper]
  • Language Models Can See: Plugging Visual Controls in Text Generation. [paper] [code]
  • Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP). [paper] [code]
  • PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining. [paper]
  • Flamingo: a Visual Language Model for Few-Shot Learning. [paper]
  • Zero and R2D2: A Large-scale Chinese Cross-modal Benchmark and A Vision-Language Framework. [paper]
  • One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code. [paper]
  • Unsupervised Prompt Learning for Vision-Language Models. [paper] [code]

CVPR 2022

  • Vision-Language Pre-Training with Triple Contrastive Learning. [paper] [code]
  • Multi-modal Alignment using Representation Codebook. [paper]
  • Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment. [paper] [code]
  • Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions. [paper] [code]
  • Towards General Purpose Vision Systems. [paper] [code]
  • Are Multimodal Transformers Robust to Missing Modality? [paper]

ICLR 2022

  • How Much Can CLIP Benefit Vision-and-Language Tasks? [paper] [code]
  • Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. [paper] [code]
  • Evaluating language-biased image classification based on semantic representations. [paper]
  • SimVLM: Simple Visual Language Model Pretraining with Weak Supervision. [paper]
  • FILIP: Fine-grained Interactive Language-Image Pre-Training. [paper]

2021

arxiv 2021

  • Learning to Prompt for Vision-Language Models. [paper] [code]
  • CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models. [paper]
  • NU¨ WA: Visual Synthesis Pre-training for Neural visUal World creAtion. [paper]
  • Prompting Visual-Language Models for Efficient Video Understanding. [paper]
  • A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision. [paper]
  • ERNIE-ViLG: Unified Generative Pre-training for Bidirectional Vision-Language Generation. [paper]
  • Sound and Visual Representation Learning with Multiple Pretraining Tasks. [paper]
  • Self-Training Vision Language BERTs with a Unified Conditional Model. [paper]
  • Distilled Dual-Encoder Model for Vision-Language Understanding. [paper]

NIPS 2021

  • Multimodal Few-Shot Learning with Frozen Language Models. [paper] [code]
  • VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer. [paper] [code]
  • TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation. [paper] [code]

EMNLP 2021

  • Data Efficient Masked Language Modeling for Vision and Language. [paper] (Findings)
  • Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. [paper]

ICCV 2021

  • LocTex: Learning Data-Efficient Visual Representations from Localized Textual Supervision. [paper]
  • UniT: Multimodal Multitask Learning with a Unified Transformer. [paper] [code]

ACMMM 2021

  • ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration. [paper]
  • Product-oriented Machine Translation with Cross-modal Cross-lingual Pre-training. [paper] [code]
  • Knowledge Perceived Multi-modal Pretraining in E-commerce. [paper]

ICML 2021

  • Learning Transferable Visual Models From Natural Language Supervision. [paper] [code]
  • Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. [paper]
  • Unifying Vision-and-Language Tasks via Text Generation. [paper] [code]
  • ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. [paper] [code]

ACL 2021

  • LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. [paper]
  • KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation. [paper]
  • E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning. [paper]
  • Multi-stage Pre-training over Simplified Multimodal Pre-training Models comment: <> (- VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding.)
  • UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning. [paper] [code]

IJCAI 2021

Image Captioning
  • TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning. [paper]
  • UIBert: Learning Generic Multimodal Representations for UI Understanding. [paper]

NAACL 2021

  • LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval. [paper]
  • Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions. [paper]
  • Cross-lingual Cross-modal Pretraining for Multimodal Retrieval. [paper]
  • Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models. [paper]
  • DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. [paper]

CVPR 2021

  • M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-training. [paper] [code]
  • VinVL: Revisiting Visual Representations in Vision-Language Models. [paper] [code]

AAAI 2021

Single Stream
  • ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. [paper]

ToMM 2021

  • Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training. [paper]

2020

EMNLP 2020

  • Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision. [paper] [code]

NIPS 2020

Image Captioning
  • RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning. [paper]
  • Diverse Image Captioning with Context-Object Split Latent Spaces. [paper]
  • Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning. [paper]

ACMMM 2020

Image Captioning
  • Structural Semantic Adversarial Active Learning for Image Captioning. [paper]
  • Iterative Back Modification for Faster Image Captioning. [paper]
  • Bridging the Gap between Vision and Language Domains for Improved Image Captioning. [paper]
  • Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning. [paper]
  • Improving Intra- and Inter-Modality Visual Relation for Image Captioning. [paper]
  • ICECAP: Information Concentrated Entity-aware Image Captioning. [paper]
  • Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. [paper]
Text Captioning
  • Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. [paper]
Video Captioning
  • Controllable Video Captioning with an Exemplar Sentence. [paper]
  • Poet: Product-oriented Video Captioner for E-commerce. [paper] [code]
  • Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning. [paper]
  • Relational Graph Learning for Grounded Video Description Generation. [paper]

ECCV 2020

Single-Stream
  • Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. [paper] [code]
  • UNITER: UNiversal Image-TExt Representation Learning. [paper] [code]

IJCAI 2020

Image Captioning
  • Human Consensus-Oriented Image Captioning. [paper]
  • Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning. [paper]
  • Recurrent Relational Memory Network for Unsupervised Image Captioning. [paper]
Video Captioning
  • Learning to Discretely Compose Reasoning Module Networks for Video Captioning. [paper] [code]
  • SBAT: Video Captioning with Sparse Boundary-Aware Transformer. [paper]
  • Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description. [paper]

ACL 2020

Image Captioning
  • Clue: Cross-modal Coherence Modeling for Caption Generation. [paper]
  • Improving Image Captioning Evaluation by Considering Inter References Variance. [paper]
  • Improving Image Captioning with Better Use of Caption. [paper] [code]
Video Captioning
  • MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning. [paper] [code]

CVPR 2020

arxiv 2020

Single-Stream
  • ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data. [paper]
  • Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. [paper]

ICLR 2020

Single-Stream
  • VL-BERT: Pre-training of Generic Visual-Linguistic Representations. [paper] [code]

AAAI 2020

Single-Stream
  • Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. [paper]

arxiv 2019

Single-Stream
  • VisualBERT: A Simple and Performant Baseline for Vision and Language. [paper] [code]

2019

arxiv 2019

Single-Stream
  • VisualBERT: A Simple and Performant Baseline for Vision and Language. [paper] [code]

EMNLP 2019

Cross-Stream
  • LXMERT: Learning Cross-Modality Encoder Representations from Transformers. [paper] [code]

NIPS 2019

Cross-Stream
  • ViLBERT: Pretraining Task-Agnostic Vision-linguistic Representations for Vision-and-Language Tasks. [paper] [code]

VQA

arxiv 2022

  • Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering. [paper]

VCR

AAAI 2022

  • SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning. [paper]

Detection

arxiv 2021

  • RegionCLIP: Region-based Language-Image Pretraining. [paper]

Retrieval

arxiv 2022

  • BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions. [paper]

  • ActionCLIP: A New Paradigm for Video Action Recognition. [paper] [code]

  • CLIP4Caption: CLIP for Video Caption. [paper]

  • CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. [paper] [code]

Reference and Acknowledgement

Really appreciate for there contributions in this area.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published