Plagiarism Detection for Amharic text

This project implements a plagiarism detection system for the Amharic language using the Doc2Vec model. It provides a pipeline for data preprocessing, model training, and similarity computation, which serves as the foundation for a FastAPI server.

Workflow

1. Data Preprocessing

Raw text is cleaned to prepare it for training and inference.
Stopwords are removed, and unnecessary characters are filtered out.
Text data is tokenized and transformed into a format suitable for the Doc2Vec model.

2. Model Training

The Doc2Vec model is trained on the preprocessed text data using Gensim.
Trained embeddings are saved for use in inference tasks.

3. Similarity Computation

The trained Doc2Vec model is used to calculate document similarities.
Cosine similarity is computed between the vectors of input documents.
The system identifies plagiarized sections by comparing sentences or text segments.

You can access the model weights at here.

Running the server

Clone the repository:

git clone https://github.com/Isa1asN/plagiarism-detector.git
cd plagiarism-detector

Create a new conda environment and activate it:

Tip

Install miniconda if you don't have it already!

conda create --name plagiarism-detector python=3.10
conda activate plagiarism-detector

Install dependencies:
```
pip install -r reqs.txt
```
Download the model files zip file, unzip it and put them in 'models' folder at the root of the project. You can download it here.
Run the server:
```
cd app
python -m main
```
Access the UI at http://localhost:8008

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
app		app
examples		examples
imgs		imgs
.gitignore		.gitignore
README.md		README.md
amstopwords.txt		amstopwords.txt
inference_n_distance_calc.ipynb		inference_n_distance_calc.ipynb
model_training.ipynb		model_training.ipynb
preprocessing.ipynb		preprocessing.ipynb
reqs.txt		reqs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Plagiarism Detection for Amharic text

Workflow

1. Data Preprocessing

2. Model Training

3. Similarity Computation

Running the server

About

Releases

Packages

Languages

Isa1asN/plagiarism-detector

Folders and files

Latest commit

History

Repository files navigation

Plagiarism Detection for Amharic text

Workflow

1. Data Preprocessing

2. Model Training

3. Similarity Computation

Running the server

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages