Skip to content

How to pick red wine in California? Exploratory analysis and regressions of wines webscraped from Vivino.com

Notifications You must be signed in to change notification settings

nikitoshina/Wine-Arbitrage

Repository files navigation

Abstract

Is there a correlation between wine’s price and wine’s signifiers such as rating, number of ratings, and country? To answer the question, 10,000 data points on wines were web scraped from Vivino. The data was cleaned and prepared for linear regression. A significant relationship between the explanatory variables and price was found. Additionally, a logistic regression on whether the wine’s rating is above the mean wine rating was conducted. The logistic regression had significant predictive power. A simple rule of thumb is proposed from the research for buying wine in California.

Introduction

Purchasing a bottle of wine is a heavy responsibility. The quality of the event depends on this small decision. But what if you do not know anything about wine? Is there a way to make a safe decision? Is more expensive wine better? I have web scraped over 10,000 red wines in $4 to $30 price range from Vivino.com to answer these questions. The first section will contain a description of the data collection process. The second section will discuss and visualize the data used in the analysis. The empirical model will be outlined in the third section, followed by a discussion of results and diagnostics in the fourth. The conclusion will contain possible future developments and my opinion on the project.

Data Description

Vivino Web Scraping

A wine rating is a readily accessible measure of consumer sentiment and often the only way to suppose if the wine is good. Vivino is an online wine marketplace with over 12.5 million different wines and an active community. Wine cards have geographical information such as country, region and winery as well as information about the number of ratings, rating, vintage, and price. It must be noted that data were collected in San Francisco, California, and is specific to this location. Data was collected using docker, Rselenium, and rvest combination, so every listing would be exactly what you would find yourself. Each one-dollar price range (4-30) was scraped individually. For more information on the web scraping, please, refer to the datasource.r file.

Cleaning

The resulting dataset contains observations of red wine in the $4 to $30 price range. Wine year was replaced with years old as of the current year (2022). Countries with less than ten wines were removed. In the data used for regression, only wines younger than ten years and with more than 30 reviews were used.

##   wine_name         wine_country        wine_rating      n_ratings      
##  Length:10285       Length:10285       Min.   :2.500   Min.   :   25.0  
##  Class :character   Class :character   1st Qu.:3.600   1st Qu.:   68.0  
##  Mode  :character   Mode  :character   Median :3.800   Median :  165.0  
##                                        Mean   :3.759   Mean   :  518.5  
##                                        3rd Qu.:3.900   3rd Qu.:  464.0  
##                                        Max.   :4.600   Max.   :24713.0  
##    wine_price        y_old           winery         
##  Min.   : 4.25   Min.   : 1.000   Length:10285      
##  1st Qu.:12.99   1st Qu.: 3.000   Class :character  
##  Median :18.98   Median : 5.000   Mode  :character  
##  Mean   :18.58   Mean   : 5.257                     
##  3rd Qu.:24.00   3rd Qu.: 6.000                     
##  Max.   :30.00   Max.   :38.000

Data Exploration

The relationship between the wine price and explanatory variables is summarized in the following graphs. There is an observable correlation between rating and price. Cheaper wines appear to have more reviews. There seems to be some relationship between age and price, with older wines costing more. Countries have different distributions of prices.

The following tables present distributions of explanatory variables. Wine ratings appear to be normally distributed, which might reflect Vivino’s recommendation algorithm. Low wine ratings appear to be prevalent, with a large portion settled before 500 reviews. Most of the dataset’s wines lay between the 1-5 age range, which can be expected for the price range. Lastly, products of some countries are more present in the sample due to the geographic position of the collection or simply the larger variety produced by the government. The distribution on the number of ratings appears to follow log distribution, so it will be used as a log in the regression.

The range of age variable might come from non-available for purchase or bad stock wines. The following graph examines if age leads to higher rating for each country. For some countries there is not enough data points to make a definitive answer. Nonetheless, in most cases wines younger than 5 years have higher ratings.

Empirical Models

To estimate the relationships between variables I will be using linear model. The regression price prediction: Price ~ age + age^2 + country + rating + log(n_ratings) + rating:log(n_ratings). Age and rating are expected to have a positive correlation, while number_of_ratings is suspected of having a negative correlation because affordable wines have more reviews. The variable country will be converted into a series of dummy variables, so some countries are generally expected to have better or worse wine, which will also be examined using logistic regression.

Results and Diagnostics

The regression results suggest that age and wine_rating positively correlate with price, while n_ratings have negative. Each extra year increases the price by $1.05. A 1% increase in the number of ratings coincides with a 8.11% decrease in price. The interaction variable between n_ratings and wine_rating suggests that 1 unit increase in rating increases marginal effect of 1% change of number of ratings on price by 1.91%. Three of these variables have a significant relationship at the 1% level. Countries have a variable impact on the price, with some being better than others. Wald Test on the significance of the country variable resulted in 352.64, which is enough to accept that the country variable is significant at 1% significance. Level R2 of the regression is 37.65%, meaning the variability of explanatory variables explains 37.65% of the price variation. F statistic for the regression is 282.11, which is enough to confirm that the regression has substantial explanatory power at 1% significance level. As for heteroskedasticity Breusch-Pagan test was used yielding 1.9825018^{-20}, which allows us to conclude that there is heteroskedasticity at significance level below 1%. To test for multicollinearity, VIF is used. All explanatory variables before introduction of interaction term had VIF less than 1.1, meaning multicollinearity was not a concern for the model. Nonetheless, introduction of interaction term reduced intercept and wine country Chile significance below threshold, which was expected.

  wine price
Predictors Estimates CI p
(Intercept) -2.66 -9.78 – 4.46 0.464
y old 1.05 0.77 – 1.33 <0.001
y old^2 -0.05 -0.08 – -0.03 <0.001
wine country \[Australia\] 1.29 0.63 – 1.96 <0.001
wine country \[Austria\] 5.22 3.44 – 6.99 <0.001
wine country \[Chile\] -0.35 -0.98 – 0.28 0.280
wine country \[France\] 3.21 2.73 – 3.69 <0.001
wine country \[Germany\] 5.92 3.57 – 8.27 <0.001
wine country \[Greece\] 2.72 1.38 – 4.06 <0.001
wine country \[Israel\] 1.91 0.58 – 3.24 0.005
wine country \[Italy\] 2.86 2.40 – 3.33 <0.001
wine country \[Mexico\] 5.52 3.17 – 7.87 <0.001
wine country \[New
Zealand\]
6.25 5.04 – 7.46 <0.001
wine country \[Portugal\] -1.78 -2.54 – -1.01 <0.001
wine country \[South
Africa\]
1.58 0.72 – 2.43 <0.001
wine country \[Spain\] 0.44 -0.09 – 0.97 0.105
wine country \[United
States\]
0.80 0.35 – 1.24 <0.001
wine country \[Uruguay\] 2.27 -0.02 – 4.56 0.052
n ratings \[log\] -8.11 -9.45 – -6.77 <0.001
wine rating 5.55 3.67 – 7.42 <0.001
n ratings \[log\] \* wine
rating
1.91 1.56 – 2.26 <0.001
Observations 9365
R2 / R2 adjusted 0.376 / 0.375
## [1] "Wald test statistic: 352.64"

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_price_model
## BP = 142.14, df = 20, p-value < 2.2e-16

##                                  GVIF Df GVIF^(1/(2*Df))
## y_old                       25.284233  1        5.028343
## I(y_old^2)                  25.173723  1        5.017342
## wine_country                 1.111022 15        1.003516
## log(n_ratings)             255.205102  1       15.975140
## wine_rating                 16.814468  1        4.100545
## log(n_ratings):wine_rating 280.916366  1       16.760560

To address the heteroskedasticity a robust regression was performed. The t-statistics were not significantly different from the original OLS regression, so we can conclude heteroskedasticity is not a major concern.

coefficients Robust OLS
(Intercept) -0.786 -0.732
y_old 7.248 7.432
I(y_old^2) -4.036 -4.184
wine_countryAustralia 3.883 3.838
wine_countryAustria 6.440 5.768
wine_countryChile -1.236 -1.081
wine_countryFrance 13.986 13.064
wine_countryGermany 5.513 4.941
wine_countryGreece 3.545 3.987
wine_countryIsrael 3.000 2.808
wine_countryItaly 12.569 11.973
wine_countryMexico 5.981 4.605
wine_countryNew Zealand 12.879 10.099
wine_countryPortugal -4.501 -4.554
wine_countrySouth Africa 3.759 3.604
wine_countrySpain 1.684 1.621
wine_countryUnited States 3.798 3.493
wine_countryUruguay 2.317 1.940
log(n_ratings) -13.106 -11.877
wine_rating 6.198 5.795
log(n_ratings):wine_rating 11.661 10.587

Is the wine good?

Another question is, would it possible to guess whether a wine is good? Let us divide wines into bad, below mean rating, and good, above mean rating. Logistic regression was run using price, age, and number of ratings to answer the question. The data was partitioned into 80% train and 20% test data. The resulting regression good ~ y_old + y_old^2 + wine_country + log(n_ratings) + wine_price found that at 1% significance level there is negative correlation with age and positive correlation with log of number of ratings and price. Only a few countries pass the 5% significance level. Nonetheless, the Wald Test statistic of 420.92 suggests that the variable country is significant to be included in the regression. The R2 is 23.6%, meaning 23.6% of the deviance in good/bad is explained by variance of explanatory variables. F statistic is 420.92, which is enough to confirm that the regression has substantial explanatory power at a 1% significance level. The area under the ROC curve is 78.1%, indicating that the model can distinguish between good and bad wines.

  good
Predictors Odds Ratios CI p
(Intercept) 0.03 0.02 – 0.04 <0.001
y old 0.88 0.85 – 0.90 <0.001
wine country \[Australia\] 0.52 0.38 – 0.71 <0.001
wine country \[Austria\] 0.33 0.13 – 0.82 0.017
wine country \[Chile\] 0.75 0.56 – 1.01 0.063
wine country \[France\] 0.77 0.61 – 0.96 0.022
wine country \[Germany\] 0.67 0.23 – 2.22 0.481
wine country \[Greece\] 0.80 0.42 – 1.54 0.502
wine country \[Israel\] 0.78 0.41 – 1.47 0.444
wine country \[Italy\] 0.72 0.58 – 0.90 0.004
wine country \[Mexico\] 0.62 0.21 – 1.96 0.386
wine country \[New
Zealand\]
0.16 0.09 – 0.28 <0.001
wine country \[Portugal\] 1.56 1.08 – 2.26 0.017
wine country \[South
Africa\]
1.05 0.69 – 1.62 0.809
wine country \[Spain\] 0.97 0.76 – 1.25 0.838
wine country \[United
States\]
1.28 1.04 – 1.58 0.021
wine country \[Uruguay\] 1.18 0.42 – 3.49 0.758
n ratings \[log\] 1.28 1.23 – 1.34 <0.001
wine price 1.19 1.18 – 1.20 <0.001
Observations 7492
R2 Tjur 0.235
## [1] "Wald test statistic: 414.09"

## Area under the curve: 0.772

So which country should I buy?

If you live in California, according to the t-test results, wines made in the U.S. have a significantly better rating on Vivino.

Conclusion

Searching for a good bottle of wine is not easy and might feel like rolling a dice. Luckily, there appears to be some correlation between ratings, prices, and countries. While it is impossible to make a 100% accurate prediction, devising a rule of thumb can be reasonable. If you live in California look for the U.S.-made wines and try to buy the most expensive you can reasonably afford, or download Vivino and check the reviews before buying.

There is a more efficient way to web scrape data from Vivino using their API. So, if someone wants to reproduce data collection, I would recommend going this route rather than the one I elected. Additionally, future development would involve incorporating weather data for the regions of the wines.

References

About

How to pick red wine in California? Exploratory analysis and regressions of wines webscraped from Vivino.com

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published