This project investigates the effectiveness of distinguishing between human-generated and AI-generated text using an N-Gram Model Classifier. This project explores how text classification can help identify whether a document is created by humans or generated by AI.
The primary goal is to develop and evaluate simplified language models based on N-grams - Bigram Language Model and a Trigram Language Model, and analyze their ability to classify documents accurately. Additionally, this project discusses techniques like Laplacian Smoothing to handle out-of-vocabulary (OOV) words and a system for sentence generation.
-
N-grams:
- An N-gram is a sequence of
N
consecutive words. - In this project, an N-Gram model is implemented but only Bi-gram (N=2) and Tri-gram (N=3) are used for classification
- An N-gram is a sequence of
-
Model Formulation (for Bi-gram):
-
The probability of a document ( D ) given a class ( y ) is calculated as:
$P(D | y) = P(w_{1:n} | y) = \prod_{i=1}^{n} P(w_i | w_{i-1}, y)$ -
The conditional probability is estimated as:
$P(w_i | w_{i-1}, y) = \dfrac{C(w_{i-1}w_i | y)}{C(w_{i-1} | y)}$
-
-
Classification:
- A document is labeled as belonging to the class ( y ) with the highest probability:
$f(w_{1:n}) = \arg \max_y P(y)P(w_{1:n} | y)$
- A document is labeled as belonging to the class ( y ) with the highest probability:
-
Smoothing:
-
Laplacian Smoothing is applied to address Out-Of-Vocabulary words by ensuring all probabilities are non-zero:
$P(w_i | w_{i-1}) \approx \dfrac{C(w_{i-1}w_i) + 1}{C(w_{i-1}) + |V|}$ where ( |V| ) is the size of the vocabulary.
-
Laplacian Smoothing is applied to address Out-Of-Vocabulary words by ensuring all probabilities are non-zero:
- Implements Bigram and Trigram models for text classification.
- Handles Out-of-Vocabulary (OOV) words using Laplacian Smoothing.
- Calculates log-probabilities and classification accuracy for evaluation.
NGram-TextClassification-SentenceGeneration/
├── data/ # Folder for datasets
├── src/ # Source code for the project
│ ├── models/ # Classes and logic for models
│ ├── utils/ # Utility functions and helpers
│ ├── main.py # Entry point for running the project
│ └── __init__.py # Make src a package
├── requirements.txt # Python dependencies
├── README.md # Project overview and usage instructions
├── .gitignore # Ignored files and folders (e.g., data, logs)
Ensure you have Python 3.8+ installed.
git clone https://github.com/VarunAgarwal10/NGram-TextClassification-SentenceGeneration
cd NGram-TextClassification-SentenceGeneration
pip install -r requirements.txt
python ./src/main.py