- Abstract
- Introduction
- Data Description
- Empirical Models
- Results and Diagnostics
- Conclusion
- References
Is there a correlation between wine’s price and wine’s signifiers such as rating, number of ratings, and country? To answer the question, 10,000 data points on wines were web scraped from Vivino. The data was cleaned and prepared for linear regression. A significant relationship between the explanatory variables and price was found. Additionally, a logistic regression on whether the wine’s rating is above the mean wine rating was conducted. The logistic regression had significant predictive power. A simple rule of thumb is proposed from the research for buying wine in California.
Purchasing a bottle of wine is a heavy responsibility. The quality of the event depends on this small decision. But what if you do not know anything about wine? Is there a way to make a safe decision? Is more expensive wine better? I have web scraped over 10,000 red wines in $4 to $30 price range from Vivino.com to answer these questions. The first section will contain a description of the data collection process. The second section will discuss and visualize the data used in the analysis. The empirical model will be outlined in the third section, followed by a discussion of results and diagnostics in the fourth. The conclusion will contain possible future developments and my opinion on the project.
A wine rating is a readily accessible measure of consumer sentiment and often the only way to suppose if the wine is good. Vivino is an online wine marketplace with over 12.5 million different wines and an active community. Wine cards have geographical information such as country, region and winery as well as information about the number of ratings, rating, vintage, and price. It must be noted that data were collected in San Francisco, California, and is specific to this location. Data was collected using docker, Rselenium, and rvest combination, so every listing would be exactly what you would find yourself. Each one-dollar price range (4-30) was scraped individually. For more information on the web scraping, please, refer to the datasource.r file.
The resulting dataset contains observations of red wine in the $4 to $30 price range. Wine year was replaced with years old as of the current year (2022). Countries with less than ten wines were removed. In the data used for regression, only wines younger than ten years and with more than 30 reviews were used.
## wine_name wine_country wine_rating n_ratings
## Length:10285 Length:10285 Min. :2.500 Min. : 25.0
## Class :character Class :character 1st Qu.:3.600 1st Qu.: 68.0
## Mode :character Mode :character Median :3.800 Median : 165.0
## Mean :3.759 Mean : 518.5
## 3rd Qu.:3.900 3rd Qu.: 464.0
## Max. :4.600 Max. :24713.0
## wine_price y_old winery
## Min. : 4.25 Min. : 1.000 Length:10285
## 1st Qu.:12.99 1st Qu.: 3.000 Class :character
## Median :18.98 Median : 5.000 Mode :character
## Mean :18.58 Mean : 5.257
## 3rd Qu.:24.00 3rd Qu.: 6.000
## Max. :30.00 Max. :38.000
The relationship between the wine price and explanatory variables is summarized in the following graphs. There is an observable correlation between rating and price. Cheaper wines appear to have more reviews. There seems to be some relationship between age and price, with older wines costing more. Countries have different distributions of prices.
The following tables present distributions of explanatory variables. Wine ratings appear to be normally distributed, which might reflect Vivino’s recommendation algorithm. Low wine ratings appear to be prevalent, with a large portion settled before 500 reviews. Most of the dataset’s wines lay between the 1-5 age range, which can be expected for the price range. Lastly, products of some countries are more present in the sample due to the geographic position of the collection or simply the larger variety produced by the government. The distribution on the number of ratings appears to follow log distribution, so it will be used as a log in the regression.
The range of age variable might come from non-available for purchase or bad stock wines. The following graph examines if age leads to higher rating for each country. For some countries there is not enough data points to make a definitive answer. Nonetheless, in most cases wines younger than 5 years have higher ratings.
To estimate the relationships between variables I will be using linear
model. The regression price prediction:
Price ~ age + age^2 + country + rating + log(n_ratings) + rating:log(n_ratings)
.
Age and rating are expected to have a positive correlation, while
number_of_ratings is suspected of having a negative correlation
because affordable wines have more reviews. The variable country will be
converted into a series of dummy variables, so some countries are
generally expected to have better or worse wine, which will also be
examined using logistic regression.
The regression results suggest that age and wine_rating
positively
correlate with price, while n_ratings
have negative. Each extra year
increases the price by $1.05. A 1% increase in the number of ratings
coincides with a 8.11% decrease in price. The interaction variable
between n_ratings
and wine_rating
suggests that 1 unit increase in
rating increases marginal effect of 1% change of number of ratings on
price by 1.91%. Three of these variables have a significant relationship
at the 1% level. Countries have a variable impact on the price, with
some being better than others. Wald Test on the significance of the
country variable resulted in 352.64, which is enough to accept that the
country variable is significant at 1% significance. Level R2
of the regression is 37.65%, meaning the variability of explanatory
variables explains 37.65% of the price variation. F statistic for the
regression is 282.11, which is enough to confirm that the regression has
substantial explanatory power at 1% significance level. As for
heteroskedasticity Breusch-Pagan test was used yielding 1.9825018^{-20},
which allows us to conclude that there is heteroskedasticity at
significance level below 1%. To test for multicollinearity, VIF is used.
All explanatory variables before introduction of interaction term had
VIF less than 1.1, meaning multicollinearity was not a concern for the
model. Nonetheless, introduction of interaction term reduced intercept
and wine country Chile significance below threshold, which was expected.
wine price | |||
---|---|---|---|
Predictors | Estimates | CI | p |
(Intercept) | -2.66 | -9.78 – 4.46 | 0.464 |
y old | 1.05 | 0.77 – 1.33 | <0.001 |
y old^2 | -0.05 | -0.08 – -0.03 | <0.001 |
wine country \[Australia\] | 1.29 | 0.63 – 1.96 | <0.001 |
wine country \[Austria\] | 5.22 | 3.44 – 6.99 | <0.001 |
wine country \[Chile\] | -0.35 | -0.98 – 0.28 | 0.280 |
wine country \[France\] | 3.21 | 2.73 – 3.69 | <0.001 |
wine country \[Germany\] | 5.92 | 3.57 – 8.27 | <0.001 |
wine country \[Greece\] | 2.72 | 1.38 – 4.06 | <0.001 |
wine country \[Israel\] | 1.91 | 0.58 – 3.24 | 0.005 |
wine country \[Italy\] | 2.86 | 2.40 – 3.33 | <0.001 |
wine country \[Mexico\] | 5.52 | 3.17 – 7.87 | <0.001 |
wine country \[New Zealand\] |
6.25 | 5.04 – 7.46 | <0.001 |
wine country \[Portugal\] | -1.78 | -2.54 – -1.01 | <0.001 |
wine country \[South Africa\] |
1.58 | 0.72 – 2.43 | <0.001 |
wine country \[Spain\] | 0.44 | -0.09 – 0.97 | 0.105 |
wine country \[United States\] |
0.80 | 0.35 – 1.24 | <0.001 |
wine country \[Uruguay\] | 2.27 | -0.02 – 4.56 | 0.052 |
n ratings \[log\] | -8.11 | -9.45 – -6.77 | <0.001 |
wine rating | 5.55 | 3.67 – 7.42 | <0.001 |
n ratings \[log\] \* wine rating |
1.91 | 1.56 – 2.26 | <0.001 |
Observations | 9365 | ||
R2 / R2 adjusted | 0.376 / 0.375 |
## [1] "Wald test statistic: 352.64"
##
## studentized Breusch-Pagan test
##
## data: lm_price_model
## BP = 142.14, df = 20, p-value < 2.2e-16
## GVIF Df GVIF^(1/(2*Df))
## y_old 25.284233 1 5.028343
## I(y_old^2) 25.173723 1 5.017342
## wine_country 1.111022 15 1.003516
## log(n_ratings) 255.205102 1 15.975140
## wine_rating 16.814468 1 4.100545
## log(n_ratings):wine_rating 280.916366 1 16.760560
To address the heteroskedasticity a robust regression was performed. The t-statistics were not significantly different from the original OLS regression, so we can conclude heteroskedasticity is not a major concern.
coefficients | Robust | OLS |
---|---|---|
(Intercept) | -0.786 | -0.732 |
y_old | 7.248 | 7.432 |
I(y_old^2) | -4.036 | -4.184 |
wine_countryAustralia | 3.883 | 3.838 |
wine_countryAustria | 6.440 | 5.768 |
wine_countryChile | -1.236 | -1.081 |
wine_countryFrance | 13.986 | 13.064 |
wine_countryGermany | 5.513 | 4.941 |
wine_countryGreece | 3.545 | 3.987 |
wine_countryIsrael | 3.000 | 2.808 |
wine_countryItaly | 12.569 | 11.973 |
wine_countryMexico | 5.981 | 4.605 |
wine_countryNew Zealand | 12.879 | 10.099 |
wine_countryPortugal | -4.501 | -4.554 |
wine_countrySouth Africa | 3.759 | 3.604 |
wine_countrySpain | 1.684 | 1.621 |
wine_countryUnited States | 3.798 | 3.493 |
wine_countryUruguay | 2.317 | 1.940 |
log(n_ratings) | -13.106 | -11.877 |
wine_rating | 6.198 | 5.795 |
log(n_ratings):wine_rating | 11.661 | 10.587 |
Another question is, would it possible to guess whether a wine is good?
Let us divide wines into bad, below mean rating, and good, above mean
rating. Logistic regression was run using price, age, and number of
ratings to answer the question. The data was partitioned into 80% train
and 20% test data. The resulting regression
good ~ y_old + y_old^2 + wine_country + log(n_ratings) + wine_price
found that at 1% significance level there is negative correlation with
age and positive correlation with log of number of ratings and price.
Only a few countries pass the 5% significance level. Nonetheless, the
Wald Test statistic of 420.92 suggests that the variable country is
significant to be included in the regression. The R2 is
23.6%, meaning 23.6% of the deviance in good/bad is explained by
variance of explanatory variables. F statistic is 420.92, which is
enough to confirm that the regression has substantial explanatory power
at a 1% significance level. The area under the ROC curve is 78.1%,
indicating that the model can distinguish between good and bad wines.
good | |||
---|---|---|---|
Predictors | Odds Ratios | CI | p |
(Intercept) | 0.03 | 0.02 – 0.04 | <0.001 |
y old | 0.88 | 0.85 – 0.90 | <0.001 |
wine country \[Australia\] | 0.52 | 0.38 – 0.71 | <0.001 |
wine country \[Austria\] | 0.33 | 0.13 – 0.82 | 0.017 |
wine country \[Chile\] | 0.75 | 0.56 – 1.01 | 0.063 |
wine country \[France\] | 0.77 | 0.61 – 0.96 | 0.022 |
wine country \[Germany\] | 0.67 | 0.23 – 2.22 | 0.481 |
wine country \[Greece\] | 0.80 | 0.42 – 1.54 | 0.502 |
wine country \[Israel\] | 0.78 | 0.41 – 1.47 | 0.444 |
wine country \[Italy\] | 0.72 | 0.58 – 0.90 | 0.004 |
wine country \[Mexico\] | 0.62 | 0.21 – 1.96 | 0.386 |
wine country \[New Zealand\] |
0.16 | 0.09 – 0.28 | <0.001 |
wine country \[Portugal\] | 1.56 | 1.08 – 2.26 | 0.017 |
wine country \[South Africa\] |
1.05 | 0.69 – 1.62 | 0.809 |
wine country \[Spain\] | 0.97 | 0.76 – 1.25 | 0.838 |
wine country \[United States\] |
1.28 | 1.04 – 1.58 | 0.021 |
wine country \[Uruguay\] | 1.18 | 0.42 – 3.49 | 0.758 |
n ratings \[log\] | 1.28 | 1.23 – 1.34 | <0.001 |
wine price | 1.19 | 1.18 – 1.20 | <0.001 |
Observations | 7492 | ||
R2 Tjur | 0.235 |
## [1] "Wald test statistic: 414.09"
## Area under the curve: 0.772
If you live in California, according to the t-test results, wines made in the U.S. have a significantly better rating on Vivino.
Searching for a good bottle of wine is not easy and might feel like rolling a dice. Luckily, there appears to be some correlation between ratings, prices, and countries. While it is impossible to make a 100% accurate prediction, devising a rule of thumb can be reasonable. If you live in California look for the U.S.-made wines and try to buy the most expensive you can reasonably afford, or download Vivino and check the reviews before buying.
There is a more efficient way to web scrape data from Vivino using their API. So, if someone wants to reproduce data collection, I would recommend going this route rather than the one I elected. Additionally, future development would involve incorporating weather data for the regions of the wines.