Duplicate-Finder

A Siamese network model trained on Quora question dataset which detects duplicates with an Accuracy and F1 score of ~80%.

Trained using TRAX: An end-to-end library for Deep Learning that focuses on clear code and speed. It is actively used and maintained in the Google Brain team.

Dataset

The dataset consists of more than 400k question pairs from Quora, where each pair is labeled if the pair is duplicate or not. As per the need of this model only the duplicate pairs are used to train the model, which consists of around 150k pairs.

Dataset is hosted here on Kaggle.

Model Overview

A parallel model architecture is used for combining two similar networks for each questions. These networks consists of an Embedding layer and multiple LSTM layers. The normalized mean vector from each model is used to calculate loss and train the model.

Loss Function

We used triplet loss to train the model. Only the positive (duplicate) pairs are used from the dataset. Negative (non-duplicate) pairs are achieved by pairing one sentence with another in the same batch. Then the most similar sentence pairs from non-duplicate was weighted more to train a stronger model. By trial and error, we found a margin of 0.5 (difference between positive and negative pair's cosine similarity) to be suitable for an unbiased model training.

Results

Model was tested on 10000 unseen data. With a threshold of 0.7 our model achieves an accuracy of 79.44% and an F1 score of 81.26%.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Duplicate_Finder.ipynb		Duplicate_Finder.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Duplicate-Finder

Dataset

Model Overview

Loss Function

Results

About

Releases

Packages

Contributors 2

Languages

License

snehadevi/Duplicate-Finder

Folders and files

Latest commit

History

Repository files navigation

Duplicate-Finder

Dataset

Model Overview

Loss Function

Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages