N-Gram Based Text Classification and Sentence Generation System

Project Overview

This project investigates the effectiveness of distinguishing between human-generated and AI-generated text using an N-Gram Model Classifier. This project explores how text classification can help identify whether a document is created by humans or generated by AI.

The primary goal is to develop and evaluate simplified language models based on N-grams - Bigram Language Model and a Trigram Language Model, and analyze their ability to classify documents accurately. Additionally, this project discusses techniques like Laplacian Smoothing to handle out-of-vocabulary (OOV) words and a system for sentence generation.

Bigram and Trigram Model Classifier for Human vs AI generated text

How It Works

N-grams:
- An N-gram is a sequence of N consecutive words.
- In this project, an N-Gram model is implemented but only Bi-gram (N=2) and Tri-gram (N=3) are used for classification
Model Formulation (for Bi-gram):
- The probability of a document ( D ) given a class ( y ) is calculated as:
  $P(D | y) = P(w_{1:n} | y) = \prod_{i=1}^{n} P(w_i | w_{i-1}, y)$
- The conditional probability is estimated as:
  $P(w_i | w_{i-1}, y) = \dfrac{C(w_{i-1}w_i | y)}{C(w_{i-1} | y)}$
Classification:
- A document is labeled as belonging to the class ( y ) with the highest probability:
  $f(w_{1:n}) = \arg \max_y P(y)P(w_{1:n} | y)$
Smoothing:
- Laplacian Smoothing is applied to address Out-Of-Vocabulary words by ensuring all probabilities are non-zero:
  $P(w_i | w_{i-1}) \approx \dfrac{C(w_{i-1}w_i) + 1}{C(w_{i-1}) + |V|}$ where ( |V| ) is the size of the vocabulary.

Key Features

Implements Bigram and Trigram models for text classification.
Handles Out-of-Vocabulary (OOV) words using Laplacian Smoothing.
Calculates log-probabilities and classification accuracy for evaluation.

Repository Structure

NGram-TextClassification-SentenceGeneration/
├── data/                     # Folder for datasets
├── src/                      # Source code for the project
│   ├── models/               # Classes and logic for models
│   ├── utils/                # Utility functions and helpers 
│   ├── main.py               # Entry point for running the project
│   └── __init__.py           # Make src a package
├── requirements.txt          # Python dependencies
├── README.md                 # Project overview and usage instructions
├── .gitignore                # Ignored files and folders (e.g., data, logs)

Usage

1. Prerequisites

Ensure you have Python 3.8+ installed.

2. Clone the Project Repository

git clone https://github.com/VarunAgarwal10/NGram-TextClassification-SentenceGeneration 
cd NGram-TextClassification-SentenceGeneration

3. Install dependencies:

 pip install -r requirements.txt

Run the model and explore results:

python ./src/main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

N-Gram Based Text Classification and Sentence Generation System

Project Overview

Bigram and Trigram Model Classifier for Human vs AI generated text

How It Works

Key Features

Repository Structure

Usage

1. Prerequisites

2. Clone the Project Repository

3. Install dependencies:

Run the model and explore results:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

VarunAgarwal10/NGram-TextClassification-SentenceGeneration

Folders and files

Latest commit

History

Repository files navigation

N-Gram Based Text Classification and Sentence Generation System

Project Overview

Bigram and Trigram Model Classifier for Human vs AI generated text

How It Works

Key Features

Repository Structure

Usage

1. Prerequisites

2. Clone the Project Repository

3. Install dependencies:

Run the model and explore results:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages