Yelp Business Stars’ Rating Prediction

https://colab.research.google.com/drive/1q5rvPOO8DvD8DV5DNLMVc8UDY7ntWHah

Tradition (Standard) AI Models : KNN | SVM | Logistic Regression | Multinomial Naive Bayes | Linear Regression

Deep Learning Models : Neural Network ( Regression & Classification )

Problem statement

Predicting the review stars from 1-5 star ratings based on the review given by the user.

Machine Learning project aims

learn text vectorization (IF-IDF)
big data handling & preprocess the data
merging two big datasets
treat problem as rgression and classification, observe it
Apply and compare tradition AI models with Deep Learning Nueral Network

Tools and Libraries used

sklearn
TensorFlow
Numpy
Pandas

Dataset

https://www.yelp.com/dataset/download

Load dataset

The data containing json files was converted to a compatible file to load on pandas’ data frame.Used business. json and review.json files to understand the dataset. Grouped the multiple reviews on bussiness_id to get all reviews given by the user into one text.

Merged the datasets with on BusinessID and got the final dataset shape as below

Data Pre-Processing/ Cleaning

Dropped the rows with categories that have null values
Filtered the data frame more by removing rows with business Ids having review count less than a certain threshold
Cleaned the reviews text data by removing stop words, punctuations and white spaces.
Used TF-IDF vectorization for Feature Extraction and used its parameters
Performed label encoding on the “stars” column (Output Feature)
Normalized the “ Review_count “ Column to make it comparable with min-max normalization

# TF-IDF Vectorization - Feature Extraction
import sklearn.feature_extraction.text as sk_text
Tfidf_vectorizer = sk_text.TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word',
 stop_words= 'english',ngram_range=(1,1), min_df = .05 , max_df = .85)

Splitting the data

Split the data into 80% train and 20% test

Regression Model

Linear Regression

Neural Network Using Tensorflow

Used earlystopping to prevent overfitting the model and used checkpointer to save the best model ran in the loop several time to jump out of the local mininum.

Applied paramter tuning by changing following:

Activation function : relu, sigmoid,tanh
Number of Dense Layers
Number of Neurons in each layer
Learning rate for Activation
Optimizer : SGD, Adamax, Adam, Adagrad

Comparison

Classification Model

Logistic Regression

SVM

KNN

MNB

Boost up Performances

Output feature - review ratings categorised into categories as high, low and medium to boost the performance of the above applied model and it significantly boosts the performance

KNN

Logistic Regression

SVM

Neural Network Using Tensorflow

Used earlystopping to prevent overfitting the model and used checkpointer to save the best model ran in the loop several time to jump out of the local mininum.

Applied paramter tuning by changing following:

Activation function : relu, sigmoid,tanh
Number of Dense Layers
Number of Neurons in each layer
Learning rate for Activation
Optimizer : SGD, Adamax, Adam, Adagrad

Boost up Performances

Output feature - review ratings categorised into categories as high, low and medium to boost the performance of the above applied model and it significantly boosts the performance

Also applied Grid Search to get the best optimizer using keras wrappers library. This gives the best optimizer from given list for best performing model so far with accuracy, this all boost up the performance and beats the standard AI classification models.

Comparison

Comparing the NN with previously best performed Logistic Regression model

comparing all classification models

Observing all the F1 score, clearly the NN performs better than all other models such as Logistic Regression, SVM, KNN and MNB.

Mini Project 1 & 2

Mansi Patel

February 13, 2019

Prof : H. Chen

Class : CSC 215-01

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
images		images
.DS_Store		.DS_Store
README.md		README.md
Yelp_Business_Stars’_Rating_Prediction.ipynb		Yelp_Business_Stars’_Rating_Prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yelp Business Stars’ Rating Prediction

Problem statement

Dataset

Load dataset

Data Pre-Processing/ Cleaning

Splitting the data

Regression Model

Linear Regression

Neural Network Using Tensorflow

Comparison

Classification Model

Logistic Regression

SVM

KNN

MNB

Boost up Performances

KNN

Logistic Regression

SVM

Neural Network Using Tensorflow

Boost up Performances

Comparison

comparing all classification models

Mini Project 1 & 2

About

Releases

Packages

Languages

mansipatel2508/Yelp-Review-Stars-Prediction-with-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Yelp Business Stars’ Rating Prediction

Problem statement

Dataset

Load dataset

Data Pre-Processing/ Cleaning

Splitting the data

Regression Model

Linear Regression

Neural Network Using Tensorflow

Comparison

Classification Model

Logistic Regression

SVM

KNN

MNB

Boost up Performances

KNN

Logistic Regression

SVM

Neural Network Using Tensorflow

Boost up Performances

Comparison

comparing all classification models

Mini Project 1 & 2

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages