Skip to content

[Paperlist] Awesome paper list of multimodal dialog, including methods, datasets and metrics

License

Notifications You must be signed in to change notification settings

Yuco-Z/Awesome-Multi-Modal-Dialog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Paper List of Multi-Modal Dialogue

logo

Papers, codes and resources about multi-modal dialogue, including methods, datasets and related metrics.

We split the multi-modal dialogue task to Visual-Grounded Dialogue (VGD, including Visual QA or VQA), Visual Question Generation, Multimodal Conversation (MMC) and Visual Navigation (VN).

Datasets

Dataset Task Publisher Author
VQA: Visual Question Answering visual QA (VQA) ICCV 2015 Virginia Tech
Visual Dialog visual QA (VQA) CVPR 2017 VisualDialog Org.
GuessWhat?! Visual object discovery through multi-modal dialogue visual QA (VQA) CVPR 2017 Montreal Univ.
Visual Reference Resolution using Attention Memory for Visual Dialog visual QA (VQA) NIPS 2017 Postech&Disney
CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning visual QA (VQA) CVPR 2017 Stanford
Image-grounded conversations: Multimodal context for natural question and response generation (IGC) visual QA (VQA) IJCNLP 2017 Rochester&Microsoft
Towards Building Large Scale Multimodal Domain-Aware Conversation Systems (MMD) multimodal conv. (MMC) AAAI 2018 IBM
Embodied Question Answering (EQA) visual QA (VQA) CVPR 2018 Facebook
Talk the walk: Navigating new york city through grounded dialogue visual navigation (VN) ICLR 2019 MILA
Vision-and-Dialog Navigation visual navigation (VN) CoRL 2019 UoW
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog visual-grounded dialog (VGD) NAACL 2019 CMU
Image-Chat: Engaging Grounded Conversations visual-grounded dialog (VGD) ACL2020 Facebook
OpenViDial visual-grounded dialog (VGD) arxiv 2020 ShannonAI
Situated and Interactive Multimodal Conversations (SIMMC) multimodal conv./visual navigation COLING 2020 Facebook
PhotoChat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling multimodal conv. (MMC) ACL 2021 Google
MMConv: An Environment for Multimodal Conversational Search across Multiple Domains multimodal conv. (MMC) SIGIR 2021 NUS
Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images multimodal conv. (MMC) ACL2021 KAIST
OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts visual-grounded dialog (VGD) arxiv 2021 ShannonAI
MMChat: Multi-Modal Chat Dataset on Social Media visual-grounded dialog (VGD) LREC 2022 Alibaba
MSCTD: A Multimodal Sentiment Chat Translation Dataset visual-grounded dialog (VGD) arxiv 2022 Tencent

Methods

We roughly split the learning paradigm of different methods (if available) as: Fusion-Based (FB) and Attention-Based (AB).

  • Fusion-based (FB): Simple concatenation of multi-modal information at the model input.
  • Attention-Based (AB): Co-attention between different modalities to learn their relations.

Visual Grounded Dialogue

Visual grounded dialogue considers only one image for one dialogue session. The whole session is constrained to this given image. It is also know as Visual Dialog task.

Title Dataset Used Publisher Code Class
Visual Dialog VisDial v0.9 ICCV 2017 CODE FB
Open Domain Dialogue Generation with Latent Images Image-Chat; Reddit AAAI 2021 CODE FB
Maria: A Visual Experience Powered Conversational Agent Reddit; Conceptual Caption ACL 2021 CODE FB
Learning to Ground Visual Objects for Visual Dialog VisDial v0.9, v1.0; MS-COCO ACL 2021 CODE FB
Multi-Modal Open-Domain Dialogue Image-Chat; ConvAI2; EmpatheticDialogues; Wizard of WikiPedia; BlendedSkillTalk EMNLP 2021 CODE FB
Iterative Context-Aware Graph Inference for Visual Dialog VisDial v0.9, v1.0 CVPR 2020 CODE FB
Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer VisDial v1.0 EMNLP 2021 CODE FB
VD-BERT: A Unified Vision and Dialog Transformer with BERT VisDial v0.9, v1.0 EMNLP 2020 CODE FB
GuessWhat?! Visual object discovery through multi-modal dialogue Guessing;MNIST Dialog CVPR 2017 CODE FB
Ask No More: Deciding when to guess in referential visual dialogue Guessing COLING 2018 CODE FB
Visual Reference Resolution using Attention Memory for Visual Dialog MNIST Dialog; VisDial v1.0 NIPS 2017 CODE AB
Visual Coreference Resolution in Visual Dialog using Neural Module Networks MNIST Dialog; VisDial ECCV 2018 CODE AB
Dual Attention Networks for Visual Reference Resolution in Visual Dialog VisDial v1.0, v0.9 EMNLP 2019 CODE AB
Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs (LTMI) VisDial v1.0 ECCV 2020 CODE AB
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline Wikipedia; BooksCorpus; Conceptual Cations; VQA; VisDial v1.0 ECCV 2019 CODE AB
Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog VisDial v1.0; MS-COCO ACL 2019 CODE AB
Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning VisDial v0.9 CVPR 2018 CODE AB
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model VisDial v0.9 NeurIPS 2017 CODE AB
Multi-View Attention Network for Visual Dialog VisDial v1.0, v0.9 arxiv 2020 CODE AB
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning VisDial ICCV 2017 CODE FB
Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog VisDial SIGDIAL 2018 CODE FB
Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat GuessingWhat?! NAACL 2019 CODE FB

Multi-modal Conversation

Multi-modal conversation (MMC) aims at conducting conversations with multiple images. Models should understand multiple images and/or generate multi-modal responses during conversation.

Title Dataset Used Publisher Code Class
Multimodal Dialogue Response Generation PhotoChat; Reddit; YFCC100M ACL 2022 CODE FB
Constructing Multi-Modal Dialogue Dataset by Replacing Text with Semantically Relevant Images DailyDialog; Empathetic Dialog; Persona Chat; MS-COCO; Flicker30K ACL 2021 CODE FB
Towards Enriching Responses with Crowd-sourced Knowledge for Task-oriented Dialogue MMConv. MuCAI 2021 CODE FB
Multimodal Dialog System: Generating Responses via Adaptive Decoders MMD MM 2019 CODE FB
Multimodal Dialog Systems via Capturing Context-aware Dependencies of Semantic Elements MMD MM 2020 CODE FB
Multimodal Dialog System: Relational Graph-based Context-aware Question Understanding MMD MM 2021 CODE FB
User Attention-guided Multimodal Dialog Systems MMD SIGIR 2019 CODE FB
Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation DailyDialog; Flickr30K; PersonaChat MM 2021 CODE AB

Question Generation

Question generation task generates questions instead of responses based on given images. It is similar to visual-grounded dialogue.

Title Dataset Used Publisher Code
Category-Based Strategy-Driven Question Generator for Visual Dialogue GuessingWhat?! CCL 2021 CODE
Visual Dialogue State Tracking for Question Generation GuessingWhat?! AAAI 2020 CODE
Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue GuessWhat?!; MS-COCO MM 2020 CODE
Goal-Oriented Visual Question Generation via Intermediate Re- wards GuessWhat?! ECCV 2018 CODE
Learning goal-oriented visual dialog via tempered policy gradient GuessWhat?! SLT 2018 CODE
Information maximizing visual question generation VQG ICCV 2019 CODE

Visual Navigation

Visual navigation focuses on guiding users to their destination from their starter points given surrounding information. Here we mainly collect methods that involve conversational guidance in language.

Title Dataset Used Publisher Code
Learning to interpret natural language navigation instructions from observations Walk the Talk AAAI 2011 CODE
Talk the walk: Navigating new york city through grounded dialogue Talk the Walk; GuessWhat?! Arxiv 2018 CODE
Mapping Navigation Instructions to Continuous Control Actions with Position-Visitation Prediction LANI CoRL 2018 CODE
Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments R2R; VQA; Matterport3D CVPR 2018 CODE
Embodied Question Answering EQA; VQA CVPR 2018 CODE
IQA: Visual Question Answering in Interactive Environments IQUAD; VQA; AI2-THOR CVPR 2018 CODE
Natural language navigation and spatial reasoning in visual street environments Touchdown; VQA; ReferItGame; Google Refexp; Talk the Walk CVPR 2019 CODE
Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation R2R; R4R ACL 2019 CODE
Learning to navigate unseen environments: Back translation with environmental dropout Matterport3D NAACL 2019 CODE
Touchdown: Natural language navigation and spatial reasoning in visual street environments Touchdown; VQA; ReferItGame; Google Refexp; Talk the Walk CVPR 2019 CODE

Metrics

A summary paper of visual dialogue metrics: A Revised Generative Evaluation of Visual Dialogue, CODE.

We split the related metrics to Rank-based and Generiate-based

  • Rank-based: measures the quality of responses that retrieve from response candidates.
  • Generate-based: measures the quality of responses that generate from the model.

Rank-based Metrics

Metrics Better indicator Explanation
Mean $\downarrow$ mean rank of ground truth response in candidates
R@k $\uparrow$ ratio of ground truth response in the top-k ranked responses
Mean Reciprocal Rank (MRR) $\uparrow$ mean reciprocal rank of the ground truth response in the ranked responses
Normalized Discounted Cumulative Gain@k(NDCG@k) $\uparrow$ relevance score list, assigns 0-1 for 100 candidates responses based on semantic similarity with the ground truth responses

Generate-based Metrics

Metrics Better indicator
BLEU-k $\uparrow$
ROUGE-k/L $\uparrow$
Meteor $\uparrow$
CIDEr $\uparrow$
Embedding Average $\uparrow$
Embedding Extrema $\uparrow$
Embedding Greedy $\uparrow$

Model-based Metrics

The model-based metrics are evaluated by off-the-shelf language models or multimodal language models. Typically there are different dimensions for evaluation.

Metrics Better indicator
Fluency $\uparrow$
Relevance $\uparrow$
Knowledge $\uparrow$
Correctness $\uparrow$

Related Project

TBD

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •