Predicting Movie Revenue and Number of Audience with Linear Regression

Regression_Project

Predicting Movie Revenue and Number of Audience with Linear Regression

After the EDA with the Korean movie data, we decided to continue on with predicting sales and number of audience of a film. In this Regression Project, Linear Regression model was used from the beginning with various manipulation of the column data and calculated RMSE(Root Mean Squared Error) to evaluate the difference between actual and predicted value.

Getting Started

Packages to install

this project was built on Python 3 with following installations:

pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install scikit-learn

Dataset

(811rowsx10columns)
Feature
- Number of Screen
- Genre
- Distributor
- AgeRate
- Release Date (month,year,season)
- Actor
Label
- Sales
- Audience

Procedure:

I. Data Cleansing

Korean movies from 2008~2020 were used.
Eliminated movies rated 'Adult'.
'Audience', 'Sales' were converted to a million.
Actors with the same name or one-syllable name were removed from the list.
A Distributor value was missing and filled in with mode value of Distributor.

II. Data Visualization

Two Histograms of Sales and Audience are right skewed, which means most of movies are struggling to be successful. Plus, we found ourselves in trouble to predict those two as imbalance data.
Since Number of Screen has correlated closely to the Sales and Audience, the histogram of Screen is right skewed as well.
We could see that Top 5 Distributor are taking over this industry.
detailed visualization outcome
detailed data exploration

III. Testing model

chose to with a RMSE indication to see how much better the model is than just predicting without adequate data cleansing.
proceeded testing in 4 different ways (code based on predicting Sales but same process was used for predicting Audience):

Label : Audience

Since Audience Data is placed with the outliers in upper fences, I supposed, if necessary, outliers will be elimated one by one within upper fences.
RMSE of Test Data decreased from 1.73 to 0.75 after continuous data cleansing.

Label : Sales

values of Sales is very skewed, which lead to removing outliers.
RMSE of Test Data decreased from 12,893.35 to 3,892.27 after continuous data cleansing.

Conclusion

We could say that the process of one-hot encoding and removal of outliers performed well on predicting the number of 'Audience'. As for 'Sales', although the RMSE decreased greatly, we could not say the same process predicted well. First because, the value was too big to intuitively see if the last RMSE is small enough to say it is a good predicted value. Second, there was no data of cost for each movie which we think is an important feature to know the revenue. (Detailed procedures and graph outcomes can be viewed from Jupyter Notebook in Sales_analysis folder)

the prediction results of first two ways differentiated the use of ordinal encoder and one-hot encoding
removing outliers greatly helped decreasing the rmse
there could have been other ways to handle with the values less than 1 and the skewed data:
- removing data that lies below a certain point we assign (ex. drop rows with sales below 1,000,000)
```
dele = movie[movie['Sales'] < 1].index
movie.drop(dele, inplace=True)
```
  - through this, the rmse decreased from 13,174.05 to 4,128.56 detailed procedure
- calculating the log value of sales to normalize skewed data
```
movie['log_sales'] = np.log1p(movie['Sales'])
```
  - rmse decreased from 1.5 to 1.1 detailed procedure

Built with

김예지
- Data cleansing, visualizing, and testing model on Sales.
- Jupyter notebook code all code files upload, additional trials of removing data and converting to log values after presentation and conclusion
- Github : https://github.com/yeji0701
방희란
- Data cleansing, Data visualizing and Testing model on Audience.
- Github : https://github.com/Heeran-cloud

Acknowledgements

KOBIS

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Sales_analysis		Sales_analysis
1st_label_encoding.py		1st_label_encoding.py
2nd_onehot_encoding.py		2nd_onehot_encoding.py
3rd_onehot_encoding+remove_outliers.py		3rd_onehot_encoding+remove_outliers.py
4th_onehot_encoding+remove_outliers_twice.py		4th_onehot_encoding+remove_outliers_twice.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Movie Revenue and Number of Audience with Linear Regression

Getting Started

Packages to install

Dataset

Procedure:

Conclusion

Built with

Acknowledgements

About

Contributors 2

Languages

yeji0701/Regression_Project

Folders and files

Latest commit

History

Repository files navigation

Predicting Movie Revenue and Number of Audience with Linear Regression

Getting Started

Packages to install

Dataset

Procedure:

Conclusion

Built with

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages