This tutorial is based on the original VILBERT repository https://github.com/facebookresearch/vilbert-multi-task
Link to tutorial https://medium.com/@naserian.elahe/vilbert-a-model-for-learning-joint-representations-of-image-content-and-natural-language-47f56a313a79