Skip to content

This Repo, explores various processes for sentiment analysis using Amazon Customer Review dataset.

Notifications You must be signed in to change notification settings

Recker-Dev/Exploring-NLP

Repository files navigation

Sentiment Analysis Project with Text Preprocessing Techniques

Project Overview

This project explores various text preprocessing techniques for sentiment analysis using a dataset of customer reviews. The goal is to evaluate the performance of different approaches in classifying the sentiment of text data. The techniques evaluated include:

  1. TF-IDF and Bag-of-Words (BoW): These are traditional text representation methods that convert text into numerical features. TF-IDF weighs word frequencies, while BoW represents text as word counts.
  2. Word2Vec and Average Word2Vec: Word2Vec generates dense vector representations of words, capturing their semantic relationships. Average Word2Vec aggregates word vectors to represent an entire review.
  3. Fine-Tuning with DistilBert: A pre-trained transformer model, DistilBert, is fine-tuned for sentiment classification to leverage powerful contextual representations of text.

Findings

TF-IDF and BoW:

  • Accuracy: Lower accuracy in sentiment classification compared to other methods.
  • Analysis: While simple to implement, these techniques fail to capture the semantic meaning of words, which limits their performance on sentiment analysis tasks.

Word2Vec and Average Word2Vec:

  • Accuracy: Significantly better performance than TF-IDF and BoW, demonstrating the benefit of capturing the semantic relationships between words.
  • Analysis: This approach performs better by using word embeddings and similarity search, which provide richer representations of the text.

Fine-Tuning with DistilBert:

  • Accuracy: Achieved the highest accuracy, outperforming both TF-IDF, BoW, and Word2Vec.
  • Analysis: The power of pre-trained language models like DistilBert is showcased in this approach, leveraging contextual understanding of text to achieve superior performance.

Recommendations

Based on the findings, we recommend the following:

  • Word2Vec or Similar Word Embedding Techniques: For sentiment analysis tasks on similar datasets, word embeddings like Word2Vec offer a good balance of performance and simplicity, capturing semantic meaning effectively.
  • Fine-Tuning a Pre-Trained Language Model: Fine-tuning models like DistilBert is likely to yield the best results, especially for larger datasets or more complex sentiment analysis tasks, due to their ability to capture contextual relationships in text.

Notes

  • The specific performance of each technique may vary depending on the dataset and machine learning model used.
  • Experimentation with different techniques, hyperparameters, and fine-tuning strategies is essential to achieving optimal results in sentiment analysis tasks.

About

This Repo, explores various processes for sentiment analysis using Amazon Customer Review dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published