Skip to content

yeji0701/Regression_Project

Repository files navigation

Regression_Project

Predicting Movie Revenue and Number of Audience with Linear Regression

After the EDA with the Korean movie data, we decided to continue on with predicting sales and number of audience of a film. In this Regression Project, Linear Regression model was used from the beginning with various manipulation of the column data and calculated RMSE(Root Mean Squared Error) to evaluate the difference between actual and predicted value.

Getting Started

Packages to install
  • this project was built on Python 3 with following installations:
pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install scikit-learn
Dataset
  • (811rowsx10columns)
  • Feature
    • Number of Screen
    • Genre
    • Distributor
    • AgeRate
    • Release Date (month,year,season)
    • Actor
  • Label
    • Sales
    • Audience

Procedure:

I. Data Cleansing

  • Korean movies from 2008~2020 were used.
  • Eliminated movies rated 'Adult'.
  • 'Audience', 'Sales' were converted to a million.
  • Actors with the same name or one-syllable name were removed from the list.
  • A Distributor value was missing and filled in with mode value of Distributor.

II. Data Visualization image

  • Two Histograms of Sales and Audience are right skewed, which means most of movies are struggling to be successful. Plus, we found ourselves in trouble to predict those two as imbalance data.
  • Since Number of Screen has correlated closely to the Sales and Audience, the histogram of Screen is right skewed as well.
  • We could see that Top 5 Distributor are taking over this industry.
  • detailed visualization outcome
  • detailed data exploration

III. Testing model

  1. Label : Audience image
  • Since Audience Data is placed with the outliers in upper fences, I supposed, if necessary, outliers will be elimated one by one within upper fences. image
  • RMSE of Test Data decreased from 1.73 to 0.75 after continuous data cleansing.
  1. Label : Sales image
  • values of Sales is very skewed, which lead to removing outliers. image
  • RMSE of Test Data decreased from 12,893.35 to 3,892.27 after continuous data cleansing.

Conclusion

We could say that the process of one-hot encoding and removal of outliers performed well on predicting the number of 'Audience'. As for 'Sales', although the RMSE decreased greatly, we could not say the same process predicted well. First because, the value was too big to intuitively see if the last RMSE is small enough to say it is a good predicted value. Second, there was no data of cost for each movie which we think is an important feature to know the revenue. (Detailed procedures and graph outcomes can be viewed from Jupyter Notebook in Sales_analysis folder)

  • the prediction results of first two ways differentiated the use of ordinal encoder and one-hot encoding
  • removing outliers greatly helped decreasing the rmse
  • there could have been other ways to handle with the values less than 1 and the skewed data:
    • removing data that lies below a certain point we assign (ex. drop rows with sales below 1,000,000)
      dele = movie[movie['Sales'] < 1].index
      movie.drop(dele, inplace=True)
      
    • calculating the log value of sales to normalize skewed data
      movie['log_sales'] = np.log1p(movie['Sales'])
      

Built with

  • 김예지
    • Data cleansing, visualizing, and testing model on Sales.
    • Jupyter notebook code all code files upload, additional trials of removing data and converting to log values after presentation and conclusion
    • Github : https://github.com/yeji0701
  • 방희란

Acknowledgements