Skip to content

VarunAgarwal10/NGram-TextClassification-SentenceGeneration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

N-Gram Based Text Classification and Sentence Generation System

Project Overview

This project investigates the effectiveness of distinguishing between human-generated and AI-generated text using an N-Gram Model Classifier. This project explores how text classification can help identify whether a document is created by humans or generated by AI.

The primary goal is to develop and evaluate simplified language models based on N-grams - Bigram Language Model and a Trigram Language Model, and analyze their ability to classify documents accurately. Additionally, this project discusses techniques like Laplacian Smoothing to handle out-of-vocabulary (OOV) words and a system for sentence generation.


Bigram and Trigram Model Classifier for Human vs AI generated text

How It Works

  1. N-grams:

    • An N-gram is a sequence of N consecutive words.
    • In this project, an N-Gram model is implemented but only Bi-gram (N=2) and Tri-gram (N=3) are used for classification
  2. Model Formulation (for Bi-gram):

    • The probability of a document ( D ) given a class ( y ) is calculated as:
      $P(D | y) = P(w_{1:n} | y) = \prod_{i=1}^{n} P(w_i | w_{i-1}, y)$

    • The conditional probability is estimated as:
      $P(w_i | w_{i-1}, y) = \dfrac{C(w_{i-1}w_i | y)}{C(w_{i-1} | y)}$

  3. Classification:

    • A document is labeled as belonging to the class ( y ) with the highest probability:
      $f(w_{1:n}) = \arg \max_y P(y)P(w_{1:n} | y)$
  4. Smoothing:

    • Laplacian Smoothing is applied to address Out-Of-Vocabulary words by ensuring all probabilities are non-zero:
      $P(w_i | w_{i-1}) \approx \dfrac{C(w_{i-1}w_i) + 1}{C(w_{i-1}) + |V|}$ where ( |V| ) is the size of the vocabulary.

Key Features

  • Implements Bigram and Trigram models for text classification.
  • Handles Out-of-Vocabulary (OOV) words using Laplacian Smoothing.
  • Calculates log-probabilities and classification accuracy for evaluation.

Repository Structure

NGram-TextClassification-SentenceGeneration/
├── data/                     # Folder for datasets
├── src/                      # Source code for the project
│   ├── models/               # Classes and logic for models
│   ├── utils/                # Utility functions and helpers 
│   ├── main.py               # Entry point for running the project
│   └── __init__.py           # Make src a package
├── requirements.txt          # Python dependencies
├── README.md                 # Project overview and usage instructions
├── .gitignore                # Ignored files and folders (e.g., data, logs)

Usage

1. Prerequisites

Ensure you have Python 3.8+ installed.

2. Clone the Project Repository

git clone https://github.com/VarunAgarwal10/NGram-TextClassification-SentenceGeneration 
cd NGram-TextClassification-SentenceGeneration

3. Install dependencies:

 pip install -r requirements.txt

Run the model and explore results:

python ./src/main.py

About

N-Gram Language Model Based Classification and Sentence Generation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages