A dataset which consists of records on short-term property rentals for entire homes is given for analysis. SAS studio software was used to deploy statistical modelling techniques on the dataset. A detailed report of the analysis and resutls was produced.
The following table gives a description of the variables in the dataset.
For this project, a data set containing the records on short-terms property rentals for entire homes was given for critical analysis. As a basic overview, the given dataset has 30 columns and 2095 rows of data regarding information on host details, property details, property reviews information and reviews scores. Among the 30 columns, there are 4 nominal, 2 ordinal, 14 discrete, 8 continuous variables and 2 additional observation identifiers (id, host_id). The nominal variables are host_is_superhost, host_has_profile_pic, host_identified_verified and property_type; the ordinal variable are host_response_time and bathrooms_text; the discrete variables are host_since, host_listings_count, accommodates, bedrooms, beds, minimum_nights, maximum_nights, availability_30, availability_60, availability_90, availability_365, number_of_reviews, number_of_reviews_ltm and number_of_reviews_130d; the continuous variables are price, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_communication, review_scores_location, review_scores_value and review_per_month.
The analysis objectives of this project are to as follow:
- To estimate the relationship between the daily price of property rentals (price) and other variables related to property details and review scores in this dataset.
- To estimate the relationship between host_is_superhost and other variables related to the host details and review scores predictors.
- To test whether the ratings score for ease of communication (review_scores_communication) is affected by the hostβs response time (host_response_time_num).
To achieve objective 1, linear regression analysis will be conducted as the response variable (price) is a numerical variable. To achieve objective 2, binary logistic regression analysis will be performed as the response variable (host_is_superhost) is a categorical variable. To reach objective 3, analysis of variance (ANOVA) will be conducted to test the relationship between the categorical variable (host_response_time_num) and numeric variable (review_scores_communication) by testing the difference between the population means of review_scores_communication grouped by host_response_time_num. SAS Studio is used as the SAS programming interface to perform analysis on our data set for this project.
Before performing statistical modelling and analysis, descriptive analysis techniques are deployed to summarize and explore the behaviour of the data involved in the study. Statistical techniques such as frequency distribution, measures of central tendency and measures of dispersion were used. Furthermore, distribution plots and box plots are generated to visualize the distribution of values for numeric variables. Appropriate data pre-processing techniques were also deployed during the descriptive analysis procedure.
To get an overview of the data set, we first observed the PROC CONTENTS table that reports metadata about the variables of our dataset that was interpreted by SAS studio (see Figure 1).
Upon observation of Figure 1, it is identified that it would be appropriate to clean and convert the categorical variable bathroom_text into a numerical variable for further analysis. Figure 2 shows observations value of the bathroom_text variable and a new variable named bathrooms that holds the converted numerical values of the bathroom_text variable.
A frequency table is generated for each categorical variable, namely host_is_superhost, host_has_profile_pic, host_identified_verified and property_type (see Figure 3, 4, 5, 6 and 7).
It is observed that the variable levels of the host_response_time variable can be sorted to a particular order with βwithin an hourβ being the least response time and βa few days or moreβ being the longest response time. Therefore, the host_response_time variable is encoded into to numeric variables. The values βwithin an hourβ, βwithin a few hourβ, βwithin a dayβ and βa few days or moreβ are encoded to the numbers 1 to 4 respectively. The encoded variable is then assigned to a new variable named host_response_time_num (see Appendix Figure 4 for code).
After pre-processing our data, the summary statistics for each numeric variables is generated. In Figure 8, the summary statistics table shows the basic statistical measures such as the mean, median, range, standard deviation, minimum, maximum, number of observations, and number of missing values of the variables. It is observed that there are quite a number of missing values for the variables bedrooms, beds, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_communication, review_scores_location, review_scores_value and review_per_month. By observing the mean, median, range, standard deviation, minimum and maximum statistics of the variables, we do not identify any data anomaly.
To visualize the distribution of values for each numeric value and detect outliers in our data, a distribution plot and box plot is generated for each numeric variable (see Figure 9). By observing the boxplots, it is apparent that all variables excluding the variables availability_30, availability_60, availability_90, availability_365, have some potential outliers. Therefore, the outliers have to be taken into considerations and further investigation on the outliers is needed to identify if the outliers are true outliers or outliers that is due to faulty data. Furthermore, it is observed that the variables host_listings_count, bathrooms, bedrooms, beds, price, minimum_nights, number_of_reviews, number_of_reviews_ltm, number_of_reviews_130d, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_communication, review_scores_location, review_scores_value and review_per_month have a highly skewed distribution.
After performing descriptive analysis on our data set, statistical modelling and analysis is conducted to meet the objectives of this study. The following content in this section will be divided into 3 parts for 3 different statistical techniques:
- Linear Regression: Explanatory Analysis on the Price of Property Rentals (price) and other variables related to property details and review scores
- Logistic Regression: Explanatory Analysis on host_is_superhost and other variables related to the host details and review scores predictors.
- ANOVA: Compare the means of review_scores_communication with different host_response_time_num
To achieve objective 1, linear regression analysis will be conducted as the response variable (price) is a numerical variable. This section will aim to estimate the relationship between price of property rentals and other potential variables that can predict the response variable such as host_listings_count, accommodates, bathrooms, bedrooms, beds, availability_30, availability_60, availability_90, availability_365, number_of_reviews_130d, reviews_scores_rating, review_scores_accuracy, review_scores_communication, review_scores_location, review_scores_value, number_of_reviews_ltm, minimum_nights, maximum_nights, host_response_time_num, reviews_per_month, number_of_reviews and review_scores_cleanliness.
Before performing statistical modelling to investigate the relationship between price of property rentals and other variables, a scatter plot matrix is constructed to investigate the linear relationships between variables and to check for outliers. As seen in Figure 10, the variable price and another 21 continuous variables are plotted against each other. It is observed that variables accommodates, bedrooms, bathrooms and bath are suggested to have a moderate linear correlation with price. Other variables such as host_listings_count, availability_30, availability_60, availability_90, availability_365, minimum_nights, maximum_nights, number_of_reviews_ltm, reviews_per_month, number_of_reviews_130d, reviews_scores_rating, review_scores_accuracy, review_scores_communication, review_scores_location, review_scores_value, number_of_reviews and review_scores_cleanliness do not seem to have a significant relation with price.
Model selection techniques is then deployed to select the most suitable variables for our model linear regression before constructing the model. The model selection procedure that are deployed are backward elimination and stepwise selection. As seen in Figure 11 and Figure 12, out of the 22 variables that are inputted into the linear regression model, only 14 variables are selected by the variables selection algorithm to be included into the model. The 14 variables that are suggested by both backward elimination and stepwise selection algorithm to be the most important variables to be included into the model to best fit the observed data are host_listings_count, accommodates, bathrooms, bedrooms, beds, availability_30, availability_60, availability_365, number_of_reviews_130d, reviews_scores_rating, review_scores_accuracy, review_scores_communication, review_scores_location and review_scores_value. The variables that are suggested to be removed from the linear regression model are number_of_reviews_ltm, minimum_nights, host_response_time_num, reviews_per_month, availability_90, number_of_reviews, maximum_nights and review_scores_cleanliness.
The output result of the regression model in Figure 13 is interpreted and analyzed. It is observed that our model has an R-Square value 0.6320. Therefore, 63.2% of the variation in property rental price is explained by the variation in host_listings_count, accommodates, bathrooms, bedrooms, beds, availability_30, availability_60, availability_365, number_of_reviews_130d, reviews_scores_rating, review_scores_accuracy, review_scores_communication, review_scores_location and review_scores_value. The Adjusted R-Square value is 0.6291. Therefore, 62.91% of the variation in property rental price is explained by the regression model adjusted for the number of independent variables and sample size. The coefficient of variation is 47.71, which is considered not bad, this suggests a moderately good model fit. Furthermore, the variance inflation factors (VIF) value suggest that there is no collinearity problem for the model since none of the VIF values for the variables are larger than 10.
The sample regression equation for the model is
Inference on Collective Influence π»0: There is no linear relationship between the response variable and the explanatory variables. π»1: There is a linear relationship between the response variable and at least one of the explanatory variables. To determine the collective influence of the explanatory variables in this dataset, it is required to perform an overall F-test for the hypothesis testing procedure. Based on Figure 13, the F-value is 215.76 and the corresponding p-value is <0.0001, therefore the null hypothesis is rejected at the 0.05 level of significance (πΌ = 0.05). There is sufficient evidence to conclude that at least one of the explanatory variables has a significant effect on the response variable. Next, the test for the significance of the individual regression coefficients is conducted to determine which explanatory variables have a significant effect on the response variable.
Inference for Individual Regression Coefficients & Confidence Interval Estimate for the Slope
π»0:Ξ²1=0
π»1:Ξ²1 β 0
where π½1 is the partial regression coefficient for π1 (host_listings_count). The test statistic t-value for host_listings_count is -8.15 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that host_listings_count has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²1 is (-0.2083, -0.1275). We are 95% confident that for every unit increase in host_listings_count, the predicted property rental daily price is estimated to decrease between $0.1275 to $0.2083.
π»0:Ξ²2 = 0
π»1:Ξ²2 β 0
where π½2 is the partial regression coefficient for π2 (accommodates). The test statistic t-value for accommodates is 4.35 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that accommodates has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²2 is (5.2025, 13.7543). We are 95% confident that for every unit increase in accommodates, the predicted property rental daily price is estimated to increase between $5.2025 to $13.7543.
π»0:Ξ²3=0
π»1:Ξ²3 β 0
where π½3 is the partial regression coefficient for π3 (bathrooms). The test statistic t-value for bathrooms is 16.65 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that accommodates has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²3 is (81.0634, 102.7092). We are 95% confident that for every unit increase in bathrooms, the predicted property rental daily price is estimated to increase between $81.0634to $102.7092.
π»0:Ξ²4=0
π»1:Ξ²4 β 0
where π½4 is the partial regression coefficient for π4 (bedrooms). The test statistic t-value for bedrooms is 3.45 with corresponding p-value 0.0006, which is larger than 0.0001, π»0 is not rejected at significance level πΌ = 0.05. There is insufficient evidence to conclude that bedrooms have a significant relationship with price, controlling for the other variables.
π»0:Ξ²5=0 π»1:Ξ²5 β 0
where π½5 is the partial regression coefficient for π5 (beds). The test statistic t-value for beds is 3.47 with corresponding p-value 0.0006, which is larger than 0.0001, π»0 is not rejected at significance level πΌ = 0.05. There is insufficient evidence to conclude that beds have a significant relationship with price, controlling for the other variables.
π»0:Ξ²6=0 π»1:Ξ²6 β 0
where π½6 is the partial regression coefficient for π6 (availability_30). The test statistic t-value for availability_30 is 5.52 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that availability_30 has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²6 is (2.823, 5.9304). We are 95% confident that for every unit increase in availability_30, the predicted property rental daily price is estimated to increase between $2.823 to $5.9304.
π»0:Ξ²7=0
π»1:Ξ²7 β 0
where π½7 is the partial regression coefficient for π7 (availability_60). The test statistic t-value for availability_60 is -4.4 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that availability_60 has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²7 is (-2.6268, -1.008). We are 95% confident that for every unit increase in availability_60, the predicted property rental daily price is estimated to decrease between $1.008 to $2.6268.
π»0:Ξ²8=0
π»1:Ξ²8 β 0
where π½8 is the partial regression coefficient for π8 (availability_365). The test statistic t-value for availability_365 is 6.01 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that availability_365 has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²8 is (0.1247, 0.2455). We are 95% confident that for every unit increase in availability_365, the predicted property rental daily price is estimated to increase between $0.1247 to $0.2455.
π»0:Ξ²9=0
π»1:Ξ²9 β 0
where π½9 is the partial regression coefficient for π9 (number_of_reviews_l30d). The test statistic t-value for number_of_reviews_l30d is -3.97 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that number_of_reviews_l30d has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²9 is (-9.8218, -3.3282). We are 95% confident that for every unit increase in number_of_reviews_l30d, the predicted property rental daily price is estimated to decrease between $3.3282 to $9.8218.
π»0:Ξ²10=0
π»1:Ξ²10 β 0
where π½10 is the partial regression coefficient for π10 (review_scores_rating). The test statistic t-value for review_scores_rating is 4.81 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that review_scores_rating has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²10 is (66.2314, 157.4406). We are 95% confident that for every unit increase in review_scores_rating, the predicted property rental daily price is estimated to increase between $66.2314 to $157.4406.
π»0:Ξ²11=0
π»1:Ξ²11 β 0
where π½11 is the partial regression coefficient for π11 (review_scores_accuracy). The test statistic t-value for review_scores_accuracy is -2.77 with corresponding p-value 0.0057, which is larger than 0.0001, π»0 is not rejected at significance level πΌ = 0.05. There is insufficient evidence to conclude that review_scores_accuracy has a significant relationship with price, controlling for the other variables.
π»0:Ξ²12=0
π»1:Ξ²12 β 0
where π½12 is the partial regression coefficient for π12 (review_scores_communication). The test statistic t-value for review_scores_communication is -4.85 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that review_scores_communication has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²12 is (-106.783, -45.3133). We are 95% confident that for every unit increase in review_scores_communication, the predicted property rental daily price is estimated to decrease between $45.3133 to $106.783.
π»0:Ξ²13=0
π»1:Ξ²13 β 0
where π½13 is the partial regression coefficient for π10 (review_scores_location). The test statistic t-value for review_scores_location is 9.55 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that review_scores_location has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²13 is (91.2376, 138.3913). We are 95% confident that for every unit increase in review_scores_location, the predicted property rental daily price is estimated to increase between $91.2376 to $138.3913.
π»0:Ξ²14=0
π»1:Ξ²14 β 0
where π½14 is the partial regression coefficient for π14 (review_scores_value). The test statistic t-value for review_scores_value is -4.87 with corresponding p-value < 0.0001, π»0 is rejected at significance level πΌ = 0.05. There is sufficient evidence to conclude that review_scores_value has a significant relationship with price, controlling for the other variables. Controlling for other explanatory variables in the model, the 95% confidence interval for Ξ²14 is (-139.0239, -59.1508). We are 95% confident that for every unit increase in review_scores_value, the predicted property rental daily price is estimated to decrease between $59.1508 to $139.0239.
To verify that our F-test and t-test in hypothesis testing for our linear regression model are reliable, it is necessary to deploy regression diagnostics to ensure that the standard regression assumptions are satisfied. Regression diagnostics plots such as the Normal Quantile-Quantile (Q-Q) Plot, Studentized Deleted Residuals (RStudent) plot, Cookβs Distance (Cookβs D) plot, Difference in Fit (DFFit) plot and Difference in Beta (DFBeta) plot is generated to check for the normality of the residuals as well as to identify high leverage points and outliers that are potential influential data.
Based on the residuals against the normal quantiles (Q-Q) plot in Figure 14, it is observed that there is no serious violation of the normality assumption although there is a slight deviation at the tails of the data. Based on the kernel density plot in Figure 14, it is observed that the density curve is slightly skewed to the right, but it is not significant to the extent of violating the normality assumption. This conclusion is not contradicted by the quantile-quantile plot.
To get a closer look of the RStudent Plot and Cookβs D plot in Figure 14, a larger version of the plot is generated in Figure 15. In addition to the RStudent Plot and Cookβs D plot, the DFFit Plot and DFBeta Plot are also generated to identify high leverage points and outliers that are potential influential data. In Figure 15, the RStudent plot shows a significant number of observations beyond two standard errors from the mean of 0. The Cookβs D plot and DFFit plot shows that there are several potential influential observations in the dataset, particularly observations #52053631, #50916991 and #47581743. To see which parameters these influential points might influence the most, the DFBeta plot is examined. Based on the DFBeta plot, observation #52053631is influential because of its effects on review_scores_communication, review_scores_accuracy and review_scores_rating; #50916991 is influential because of its effects on review_scores_location; observation #47581743 is influential because of its effects on bathrooms. These observations were analysed to ensure that they are not faulty data. After inspection of the suspicious influential points, no faulty data was found; therefore, no observations were removed.
The second objective of our study is to estimate the relationship between host_is_superhost and other variables related to the host details and review scores predictors. As such, binary logistic regression analysis is performed with the variable host_is_superhost as the response variable and the variables host_since, host_response_time_num, host_listings_count, host_has_profile_pic, host_identity_verifi and review_scores_value as the predictor variables.
Prior to moving on to the fully specified model, bivariate summaries of the host_is_superhost variable and the individual predictors are examined to understand the associations between them. Figure 16 shows a bar chart which compares host_is_superhost and host_response_time_num. It is observed that the value count true (t) is slightly higher then value count false (f) for variable host_is_superhost grouped by host_response_time_num. In Figure 17, the bar chart of host_is_superhost versus host_has_profile_pic shows that majority of the hosts has a profile picture and all host who is a superhost has a profile picture. Based on the bar chart of host_is_superhost versus host_identity_verified in Figure 18, it is observed that the value count false (f) is slightly higher then value count true (t) for variable host_is_superhost grouped by host_identity_verifi. Figure 19 illustrates a bar chart of host_is_superhost versus host_listing_count. It is observed that the majority of the hosts who are a superhost have relatively less property listing count whereas the majority of the hosts who are not a superhost host have relatively more property listing count. Figure 20 shows a histogram of host_is_superhost versus host_since. It is observed that the distribution of superhost-host count seems to peak higher than non-superhost-host when host_since is before 2017 whereas the count distribution of non-superhost-host seems to peak higher than superhost-host when host_since is after 2017. This suggest that a host is more likely to be a superhost when host_since is before 2017 and a host is more likely to not be a superhost when host_since is after 2017. This may also suggest that the earlier a host starts hosting, the larger the possibility that a host is a superhost.
Figure 21 provides information of the model, data set, the response variable, the number of response levels, the type of model, the algorithm used to obtain the parameter estimates, and the number of observations read and used in this model. Variable host_is_superhost has two response level, which are either true (t) or false (f), therefore the model is assumed to be βbinary logitβ.
The Model Fit Statistics table in Figure 22 provides three goodness-of-fit measures, namely Akaikeβs Information Criterion (AIC) test, Schwarz criterion (SC) test and the -2LogL test. By comparing these test values for the βIntercept Onlyβ column and the βIntercept and Covariatesβ column, we can observe that the βIntercept and Covariatesβ column has a smaller value, this imply that this logistic regression model is a good model to fit the data set.
**Inference on Collective Influence **
Based on the output results of the Testing Global Null Hypothesis Table in Figure 22, π»0 is rejected since the p-values for all three tests, namely the Likelihood ratio test, Score test and Wald test are <0.0001. At the 0.05 significance level, collectively the predictor variables are significant, indicating at least one of the predictors in the model is useful in predicting whether a host is a superhost.
From the Analysis of Maximum Likelihood Estimates table in Figure 23, we obtain the parameter estimates of Ξ²0=β10.5013 , Ξ²1 =β0.00021, Ξ²2 =β1.2599, Ξ²3 =β0.0142 , Ξ²4 =β10.7784, Ξ²5 =β0.3387 and Ξ²6 =3.5449. Given that reference cell coding was used in this analysis, each effect is measured against the reference level.
Inference for Individual Regression Coefficients
Based on the Type 3 Analysis of Effect Table in Figure 23, let
π»0:Ξ²1=0
π»1:Ξ²1 β 0
where π½1 is the partial regression coefficient for πβππ π‘_π ππππ. The test statistic Wald Chi-Square for host_since is 10.8199 with corresponding p-value is 0.0010, which is > 0.0001, null hypothesis is not rejected at significance level πΌ = 0.05. host_since is not significant in predicting whether a host is a superhost, controlling for the other variables.
π»0:Ξ²2=0
π»1:Ξ²2 β 0
where π½2 is the partial regression coefficient for πβππ π‘_πππ ππππ π_π‘πππ_ππ’π.The test statistic Wald Chi-Square for host_response_time_num is 39.5837 with corresponding p-value < 0.0001, null hypothesis is rejected at significance level πΌ = 0.05. host_response_time_num is significant in predicting whether a host is a superhost, controlling for the other variables.
π»0:Ξ²3=0
π»1:Ξ²3 β 0
where π½3 is the partial regression coefficient for πβππ π‘_πππ π‘πππ_πππ’ππ‘.The test statistic Wald Chi-Square for host_listing_count is 59.6846 with corresponding p-value < 0.0001, null hypothesis is rejected at significance level πΌ = 0.05. host_listing_count is significant in predicting whether a host is a superhost, controlling for the other variables.
π»0:Ξ²4=0
π»1:Ξ²4 β 0
where π½4 is the partial regression coefficient for πβππ π‘_βππ _πππππππ_πππ. The test statistic Wald Chi-Square for host_has_profile_pic is 0.0004 with corresponding p-value is 0.9832, which is > 0.0001, null hypothesis is not rejected at significance level πΌ = 0.05. host_has_profile_pic is not significant in predicting whether a host is a superhost, controlling for the other variables.
π»0:Ξ²5=0
π»1:Ξ²5 β 0
where π½5 is the partial regression coefficient for πβππ π‘_πππππ‘ππ‘π¦_π£πππππ. The test statistic Wald Chi-Square for host_identity_verifi is 5.6249 with corresponding p-value is 0.0177, which is >0.0001, null hypothesis is not rejected at significance level πΌ = 0.05. host_identity_verifi is not significant in predicting whether a host is a superhost, controlling for the other variables.
π»0:Ξ²6=0
π»1:Ξ²6 β 0
where π½6 is the partial regression coefficient for ππππ£πππ€_π πππππ _π£πππ’ππ . The test statistic Wald Chi-Square for review_scores_values is 124.6312 with corresponding p-value < 0.0001, null hypothesis is rejected at significance level πΌ = 0.05. review_scores_values is significant in predicting whether a host is a superhost, controlling for the other variables.
Based on the Association of Predicted Probabilities and Observed Responses Table in Figure 24, the c (concordance) statistics has a value of 0.809, indicating that 80.9% of the positive and negative response pairs (host_is_superhost) are correctly sorted using host_since, host_response_time_num, host_listing_count, host_has_profile_pic, host_identity_verifi and review_scores_values. This shows a strong ability for host_since, host_response_time_num, host_listing_count, host_has_profile_pic, host_identity_verifi or review_scores_values to discriminate between whether a host is a superhost.
The Odds Ratios table in Figure 24 shows that a number of 10 increase in host_listing_count is associated with a (1-0.868)% = 13.2% decrease in the odds of a host being a superhost. This suggest that the larger the host_listing_count, the less likely a host is to be a superhost.
Figure 25 shows the odds ratio plot for the Walk confidence limit of our mode. Based on the Odds Ratio Estimates table in Figure 24, for 95% confidence interval, we are confident that the true odds ratio of host_since falls between 1.000 and 1.000; the true odds ratio of host_response_time_num falls between 0.192 and 0.420; the true odds ratio of host_listings_count falls between 0.982 and 0.989; the true odds ratio of host_has_profile_pic falls between <0.001 and >999.999; the true odds ratio of host_identity_verifi falls between 0.539 and 0.943; the true odds ratio of review_scores_value falls between 18.589 and 64.538. In Figure 25, it is observed that the estimates of host_response_time_num, host_listings_count and host_identity_verifi are less then 1 whereas the estimates of review_scores_value is greater than 1. Both estimates of host_since and host_has_profile_pic intersect the reference line at odds ratio = 1, which indicates ratios that are not significantly different from 1, the effect of these two variables are not significant at the 0.05 significance level.
The effects plot in Figure 26 shows the probability of whether a host is a superhost across all combinations of categories and levels of all three predictor variables. It is observed that the probability of host_is_superhost is true decreases with the increase in the year for host_since, therefore, this suggest that the earlier a host starts hosting, the larger the probability that a host is a superhost. Furthermore, this plot suggest that a host who has a profile pic and has identity verified have the highest probability to be a superhost. Following that, the condition for a host to have the second largest probability to be a superhost is to have a profile pic and host identified not verified. The condition of a host not having a profile pic but have identified verified and the condition of a host who neither has a profile pic nor have their identity verified has little to no probability of being a superhost.
Our third objective of this study is to test whether the ratings score for ease of communication (review_scores_communication) is affected by the hostβs response time (host_response_time_num). To reach this objectve, analysis of variance (ANOVA) will be conducted to test the relationship between the categorical variable (host_response_time_num) and numeric variable (review_scores_communication) by testing the difference between the population means of review_scores_communication grouped by host_response_time_num.
Figure 27 shows the box and whiskers plot of the review_scores_communication grouped by host_response_time_num. By observing the plot, there is no significant difference between the boxes, all boxes are situated near the value 5 of review_scores_communication. It is suggested that the four host_response_time_num value may result in the same mean of the review_scores_communication. However, it is also observed that the values of review_scores_communication with the host_response_time_num = 1 are more scattered, ranging from the value 1 to 5 of review_scores_communication.
Based on the analysis of variance table in Figure 28, the reported f-value is 0.77, and the corresponding p-value is 0.5090, which is greater than 0.05, therefore, we do not reject π»0 at the 0.05 level of significance (πΌ = 0.05). There is insufficient evidence to conclude that there is statistically significant difference between the means of review_scores_communication. The four different host_response_time_num value result in the same mean review_scores_communication. Furthermore, it is observed that the R-Square value of our model is 0.0012, therefore, host_response_time_num explains about 0.12% of the variability of review_scores_communication. The total mean of the review_scores_communication is 4.8407 and the Root mean square error (RMSE) is 0.0665.
Figure 29 shows the diffogram plot of review_scores_communication comparison for host_response_time_num. It is observed that all the confidence limit for the difference cross the diagonal equivalence line, therefore, there is no significant difference between host_response_time_num 1 to 4.
In summary, the objectives of this study are to estimate the relationship between the daily price of property rentals and other variables related to property details and review scores; to estimate the relationship between host_is_superhost and other variables related to the host details and review scores predictors; and to test whether the ratings score for ease of communication is affected by the hostβs response time. For the first objective, linear regression analysis was conducted and it was found that 63.2% of the variation in property rental price is explained by the variation in host_listings_count, accommodates, bathrooms, bedrooms, beds, availability_30, availability_60, availability_365, number_of_reviews_130d, reviews_scores_rating, review_scores_accuracy, review_scores_communication, review_scores_location and review_scores_value. Controlling for the other variables, the variables that has a significant relationship with price are host_listings_count, accommodates, bathrooms, availability_30, availability_60, availability_365, number_of_reviews_130d, reviews_scores_rating, review_scores_communication, review_scores_location and review_scores_value. For the second objective, logistic regression analysis was conducted and it was found that 80.9% of the positive and negative response pairs (host_is_superhost) are correctly sorted using host_since, host_response_time_num, host_listing_count, host_has_profile_pic, host_identity_verifi and review_scores_values. Controlling for the other variables, the variables that has a significant relationship with host_is_superhost are host_response_time_num, host_listings_count, and review_scores_value. For the third objective, analysis of variance (ANOVA) is performed and it is found that there is insufficient evidence to conclude that there is statistically significant difference between the means of review_scores_communication of different host_response_time_num. Therefore, the ratings score for ease of communication is not affected by the hostβs response time.