Bot or Human? Detection of DeepFake Text with Semantic, Emoji, Sentiment and Linguistic Features

A Final Year Capstone Research Project 🤖 👩🏻‍💻

Project Overview

Developed a classifier to distinguish between machine-generated text (MGT), also known as deepfake text, and human-written text (HWT) on Twitter. Fine-tuned a pre-trained transformer model, BERT, and integrated it with emoji and linguistic features to construct a comprehensive feature vector for robust classificaiton. Achieved an impressive accuracy rate of 88.3% and identified distinct characteristics of MGT, contributing to the field of machine-generated text detection on social media platforms.

Author: Alicia Chong Tsui Ying - alicia.chong.data@gmail.com

Supervisors:

Hui Na Chua - huinac@sunway.edu.my
Muhammed Basheer Jasser - basheerj@sunway.edu.my
Richard T. K. Wong richardwtk@sunway.edu.my

Institution: Department of Computing and Information Systems, Sunway University, Malaysia

Abstract

Detecting machine-generated text (MGT), also known as Deepfake text, has become increasingly important in Artificial Intelligence (AI) age and social media platforms. With the proliferation of MGT and the potential consequences of its dissemination, there is a pressing need to develop effective methods for distinguishing between MGT and human-written text (HWT). Our research aim has two-fold: firstly, to examine the inherent differences between MGT and HWT on Twitter, and secondly, to develop a classifier specifically designed for MGT detection on the platform. This classifier utilizes contextualized text embeddings as its foundation while considering additional linguistic features, sentiment features, and emoji embeddings. Our experimental results demonstrate that incorporating additional features enhances the model's ability to detect MGT. Combining fine-tuned BERT embeddings with emoji and linguistic features using a multi-layer perceptron classifier achieves the highest accuracy rate of 88.3%. Our analysis reveals distinct characteristics of MGT compared to HWT, including differences in engagement behavior, linguistic patterns, named entities, sentiment expressions, and text perplexity. Our research contributes to the field of MGT detection by offering a comprehensive approach that combines semantic text embeddings with supplementary features. The proposed model provides a significant step forward in addressing the challenge of Deepfake text detection.

Model Schema

Enhanced TweepFake dataset

TweepFake dataset (but updated in 2023)

This dataset is based on the TweepFake dataset, a Twitter deepfake text dataset created by Fagni et al. For more information about the dataset by Fagni et al, we refer the reader to their paper.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bot or Human? Detection of DeepFake Text with Semantic, Emoji, Sentiment and Linguistic Features

A Final Year Capstone Research Project 🤖 👩🏻‍💻

Project Overview

Contents

Abstract

Enhanced TweepFake dataset

About

Releases

Packages

License

Alicia2203/Detection-of-DeepFake-Text-with-Semantic-Emoji-Sentiment-and-Linguistic-Features

Folders and files

Latest commit

History

Repository files navigation

Bot or Human? Detection of DeepFake Text with Semantic, Emoji, Sentiment and Linguistic Features

A Final Year Capstone Research Project 🤖 👩🏻‍💻

Project Overview

Contents

Abstract

Enhanced TweepFake dataset

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages