https://drive.google.com/file/d/1OVb4_3Uec_xbyUk90aWC6LFpKsIOtR7v/view?usp=sharing
Several deep learning models have been reviewed for captioning tasks, improving the possibilities of textual understanding, and captioning of both simple and complex images. In this project, a deep learning model for describing the difference between a pair of images textually. supervised model uses several deep learning techniques with natural language processing techniques to explore the practicability of capturing, aligning and computing the differences between two image features and to produce textual description of the same. First, a Siamese encoder consisting of convolutional neural networks is used to extract the feature vector related to the differences in a pair of images. Attention model is used to select salient features of the vector. The resultant feature vector is processed by a long-short term memory decoder to generate textual description. The model is evaluated on a Spot-the-diff baseline dataset which contains different pairs of images with appropriate annotations.
Keywords: Siamese encoder, convolutional neural networks, attention model, long-short term memory