This project implements a plagiarism detection system for the Amharic language using the Doc2Vec model. It provides a pipeline for data preprocessing, model training, and similarity computation, which serves as the foundation for a FastAPI server.
- Raw text is cleaned to prepare it for training and inference.
- Stopwords are removed, and unnecessary characters are filtered out.
- Text data is tokenized and transformed into a format suitable for the Doc2Vec model.
- The Doc2Vec model is trained on the preprocessed text data using Gensim.
- Trained embeddings are saved for use in inference tasks.
- The trained Doc2Vec model is used to calculate document similarities.
- Cosine similarity is computed between the vectors of input documents.
- The system identifies plagiarized sections by comparing sentences or text segments.
You can access the model weights at here.
-
Clone the repository:
git clone https://github.com/Isa1asN/plagiarism-detector.git cd plagiarism-detector
-
Create a new conda environment and activate it:
Tip
Install miniconda if you don't have it already!
conda create --name plagiarism-detector python=3.10
conda activate plagiarism-detector
-
Install dependencies:
pip install -r reqs.txt
-
Download the model files zip file, unzip it and put them in 'models' folder at the root of the project. You can download it here.
-
Run the server:
cd app python -m main
-
Access the UI at http://localhost:8008