We introduce MaXM, a test-only multi-lingual visual question answering benchmark in 7 diverse languages: English (en), French (fr), Hindi (hi), Hebrew (iw), Romanian (ro), Thai (th), and Chinese (zh). The datasets are based on the images and the captions from the Crossmodal-3600 dataset (XM3600). Check our paper for further details.
Our approach to data generation is similar to VQ^2A used to generate MAVERICS.
MaXM v1 (157KB, released on Feb 18, 2023)
Format (.json)
dataset str: dataset name version str: dataset version split str: language ID annotations List of image-question-answers triplets, each of which is -- image_id str: image ID -- image_url str: image URL -- qa_pairs List of question-answer pairs, each of which is ---- question_id str: question ID ---- question str: raw question ---- answers List of str: ground-truth answers ---- processed_answers List of str: processed ground-truth answers. 16 tokenized answers. ---- is_collection bool: "true" if the question is of the "Collection" type; "false" otherwise..
If you use this dataset in your research, please cite the original Crossmodal-3600 dataset and our paper:
Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish V. Thapliyal, Idan Szpektor, Julien Amelot, Xi Chen, Radu Soricut. MaXM: Towards Multilingual Visual Question Answering. Findings of the Association for Computational Linguistics: EMNLP, 2023.
@inproceedings{changpinyo2023maxm, title = {{MaXM}: Towards Multilingual Visual Question Answering}, author = {Changpinyo, Soravit and Xue, Linting and Yarom, Michal and Thapliyal, Ashish V. and Szpektor, Idan and Amelot, Julien and Chen, Xi and Soricut, Radu}, booktitle={Findings of the Association for Computational Linguistics: EMNLP}, year = {2023}, }
Please create an issue in this repository. If you would like to share feedback or report concerns, please email schangpi@google.com.