GitHub - nikitoshina/Wine-Arbitrage: How to pick red wine in California? Exploratory analysis and regressions of wines webscraped from Vivino.com

Abstract
Introduction
Data Description
Empirical Models
Results and Diagnostics
- Is the wine good?
- So which country should I buy?
Conclusion
References

Abstract

Is there a correlation between wine’s price and wine’s signifiers such as rating, number of ratings, and country? To answer the question, 10,000 data points on wines were web scraped from Vivino. The data was cleaned and prepared for linear regression. A significant relationship between the explanatory variables and price was found. Additionally, a logistic regression on whether the wine’s rating is above the mean wine rating was conducted. The logistic regression had significant predictive power. A simple rule of thumb is proposed from the research for buying wine in California.

Introduction

Purchasing a bottle of wine is a heavy responsibility. The quality of the event depends on this small decision. But what if you do not know anything about wine? Is there a way to make a safe decision? Is more expensive wine better? I have web scraped over 10,000 red wines in $4 to $30 price range from Vivino.com to answer these questions. The first section will contain a description of the data collection process. The second section will discuss and visualize the data used in the analysis. The empirical model will be outlined in the third section, followed by a discussion of results and diagnostics in the fourth. The conclusion will contain possible future developments and my opinion on the project.

Data Description

Vivino Web Scraping

A wine rating is a readily accessible measure of consumer sentiment and often the only way to suppose if the wine is good. Vivino is an online wine marketplace with over 12.5 million different wines and an active community. Wine cards have geographical information such as country, region and winery as well as information about the number of ratings, rating, vintage, and price. It must be noted that data were collected in San Francisco, California, and is specific to this location. Data was collected using docker, Rselenium, and rvest combination, so every listing would be exactly what you would find yourself. Each one-dollar price range (4-30) was scraped individually. For more information on the web scraping, please, refer to the datasource.r file.

Cleaning

The resulting dataset contains observations of red wine in the $4 to $30 price range. Wine year was replaced with years old as of the current year (2022). Countries with less than ten wines were removed. In the data used for regression, only wines younger than ten years and with more than 30 reviews were used.

##   wine_name         wine_country        wine_rating      n_ratings      
##  Length:10285       Length:10285       Min.   :2.500   Min.   :   25.0  
##  Class :character   Class :character   1st Qu.:3.600   1st Qu.:   68.0  
##  Mode  :character   Mode  :character   Median :3.800   Median :  165.0  
##                                        Mean   :3.759   Mean   :  518.5  
##                                        3rd Qu.:3.900   3rd Qu.:  464.0  
##                                        Max.   :4.600   Max.   :24713.0  
##    wine_price        y_old           winery         
##  Min.   : 4.25   Min.   : 1.000   Length:10285      
##  1st Qu.:12.99   1st Qu.: 3.000   Class :character  
##  Median :18.98   Median : 5.000   Mode  :character  
##  Mean   :18.58   Mean   : 5.257                     
##  3rd Qu.:24.00   3rd Qu.: 6.000                     
##  Max.   :30.00   Max.   :38.000

Data Exploration

The relationship between the wine price and explanatory variables is summarized in the following graphs. There is an observable correlation between rating and price. Cheaper wines appear to have more reviews. There seems to be some relationship between age and price, with older wines costing more. Countries have different distributions of prices.

The following tables present distributions of explanatory variables. Wine ratings appear to be normally distributed, which might reflect Vivino’s recommendation algorithm. Low wine ratings appear to be prevalent, with a large portion settled before 500 reviews. Most of the dataset’s wines lay between the 1-5 age range, which can be expected for the price range. Lastly, products of some countries are more present in the sample due to the geographic position of the collection or simply the larger variety produced by the government. The distribution on the number of ratings appears to follow log distribution, so it will be used as a log in the regression.

The range of age variable might come from non-available for purchase or bad stock wines. The following graph examines if age leads to higher rating for each country. For some countries there is not enough data points to make a definitive answer. Nonetheless, in most cases wines younger than 5 years have higher ratings.

Empirical Models

To estimate the relationships between variables I will be using linear model. The regression price prediction: Price ~ age + age^2 + country + rating + log(n_ratings) + rating:log(n_ratings). Age and rating are expected to have a positive correlation, while number_of_ratings is suspected of having a negative correlation because affordable wines have more reviews. The variable country will be converted into a series of dummy variables, so some countries are generally expected to have better or worse wine, which will also be examined using logistic regression.

Results and Diagnostics

The regression results suggest that age and wine_rating positively correlate with price, while n_ratings have negative. Each extra year increases the price by $1.05. A 1% increase in the number of ratings coincides with a 8.11% decrease in price. The interaction variable between n_ratings and wine_rating suggests that 1 unit increase in rating increases marginal effect of 1% change of number of ratings on price by 1.91%. Three of these variables have a significant relationship at the 1% level. Countries have a variable impact on the price, with some being better than others. Wald Test on the significance of the country variable resulted in 352.64, which is enough to accept that the country variable is significant at 1% significance. Level R² of the regression is 37.65%, meaning the variability of explanatory variables explains 37.65% of the price variation. F statistic for the regression is 282.11, which is enough to confirm that the regression has substantial explanatory power at 1% significance level. As for heteroskedasticity Breusch-Pagan test was used yielding 1.9825018^{-20}, which allows us to conclude that there is heteroskedasticity at significance level below 1%. To test for multicollinearity, VIF is used. All explanatory variables before introduction of interaction term had VIF less than 1.1, meaning multicollinearity was not a concern for the model. Nonetheless, introduction of interaction term reduced intercept and wine country Chile significance below threshold, which was expected.

	wine price
Predictors	Estimates	CI	p
(Intercept)	-2.66	-9.78 – 4.46	0.464
y old	1.05	0.77 – 1.33	<0.001
y old^2	-0.05	-0.08 – -0.03	<0.001
wine country \[Australia\]	1.29	0.63 – 1.96	<0.001
wine country \[Austria\]	5.22	3.44 – 6.99	<0.001
wine country \[Chile\]	-0.35	-0.98 – 0.28	0.280
wine country \[France\]	3.21	2.73 – 3.69	<0.001
wine country \[Germany\]	5.92	3.57 – 8.27	<0.001
wine country \[Greece\]	2.72	1.38 – 4.06	<0.001
wine country \[Israel\]	1.91	0.58 – 3.24	0.005
wine country \[Italy\]	2.86	2.40 – 3.33	<0.001
wine country \[Mexico\]	5.52	3.17 – 7.87	<0.001
wine country \[New Zealand\]	6.25	5.04 – 7.46	<0.001
wine country \[Portugal\]	-1.78	-2.54 – -1.01	<0.001
wine country \[South Africa\]	1.58	0.72 – 2.43	<0.001
wine country \[Spain\]	0.44	-0.09 – 0.97	0.105
wine country \[United States\]	0.80	0.35 – 1.24	<0.001
wine country \[Uruguay\]	2.27	-0.02 – 4.56	0.052
n ratings \[log\]	-8.11	-9.45 – -6.77	<0.001
wine rating	5.55	3.67 – 7.42	<0.001
n ratings \[log\] \* wine rating	1.91	1.56 – 2.26	<0.001
Observations	9365
R² / R² adjusted	0.376 / 0.375

## [1] "Wald test statistic: 352.64"

## 
##  studentized Breusch-Pagan test
## 
## data:  lm_price_model
## BP = 142.14, df = 20, p-value < 2.2e-16

##                                  GVIF Df GVIF^(1/(2*Df))
## y_old                       25.284233  1        5.028343
## I(y_old^2)                  25.173723  1        5.017342
## wine_country                 1.111022 15        1.003516
## log(n_ratings)             255.205102  1       15.975140
## wine_rating                 16.814468  1        4.100545
## log(n_ratings):wine_rating 280.916366  1       16.760560

To address the heteroskedasticity a robust regression was performed. The t-statistics were not significantly different from the original OLS regression, so we can conclude heteroskedasticity is not a major concern.

coefficients	Robust	OLS
(Intercept)	-0.786	-0.732
y_old	7.248	7.432
I(y_old^2)	-4.036	-4.184
wine_countryAustralia	3.883	3.838
wine_countryAustria	6.440	5.768
wine_countryChile	-1.236	-1.081
wine_countryFrance	13.986	13.064
wine_countryGermany	5.513	4.941
wine_countryGreece	3.545	3.987
wine_countryIsrael	3.000	2.808
wine_countryItaly	12.569	11.973
wine_countryMexico	5.981	4.605
wine_countryNew Zealand	12.879	10.099
wine_countryPortugal	-4.501	-4.554
wine_countrySouth Africa	3.759	3.604
wine_countrySpain	1.684	1.621
wine_countryUnited States	3.798	3.493
wine_countryUruguay	2.317	1.940
log(n_ratings)	-13.106	-11.877
wine_rating	6.198	5.795
log(n_ratings):wine_rating	11.661	10.587

Is the wine good?

Another question is, would it possible to guess whether a wine is good? Let us divide wines into bad, below mean rating, and good, above mean rating. Logistic regression was run using price, age, and number of ratings to answer the question. The data was partitioned into 80% train and 20% test data. The resulting regression good ~ y_old + y_old^2 + wine_country + log(n_ratings) + wine_price found that at 1% significance level there is negative correlation with age and positive correlation with log of number of ratings and price. Only a few countries pass the 5% significance level. Nonetheless, the Wald Test statistic of 420.92 suggests that the variable country is significant to be included in the regression. The R² is 23.6%, meaning 23.6% of the deviance in good/bad is explained by variance of explanatory variables. F statistic is 420.92, which is enough to confirm that the regression has substantial explanatory power at a 1% significance level. The area under the ROC curve is 78.1%, indicating that the model can distinguish between good and bad wines.

	good
Predictors	Odds Ratios	CI	p
(Intercept)	0.03	0.02 – 0.04	<0.001
y old	0.88	0.85 – 0.90	<0.001
wine country \[Australia\]	0.52	0.38 – 0.71	<0.001
wine country \[Austria\]	0.33	0.13 – 0.82	0.017
wine country \[Chile\]	0.75	0.56 – 1.01	0.063
wine country \[France\]	0.77	0.61 – 0.96	0.022
wine country \[Germany\]	0.67	0.23 – 2.22	0.481
wine country \[Greece\]	0.80	0.42 – 1.54	0.502
wine country \[Israel\]	0.78	0.41 – 1.47	0.444
wine country \[Italy\]	0.72	0.58 – 0.90	0.004
wine country \[Mexico\]	0.62	0.21 – 1.96	0.386
wine country \[New Zealand\]	0.16	0.09 – 0.28	<0.001
wine country \[Portugal\]	1.56	1.08 – 2.26	0.017
wine country \[South Africa\]	1.05	0.69 – 1.62	0.809
wine country \[Spain\]	0.97	0.76 – 1.25	0.838
wine country \[United States\]	1.28	1.04 – 1.58	0.021
wine country \[Uruguay\]	1.18	0.42 – 3.49	0.758
n ratings \[log\]	1.28	1.23 – 1.34	<0.001
wine price	1.19	1.18 – 1.20	<0.001
Observations	7492
R² Tjur	0.235

## [1] "Wald test statistic: 414.09"

## Area under the curve: 0.772

So which country should I buy?

If you live in California, according to the t-test results, wines made in the U.S. have a significantly better rating on Vivino.

Conclusion

Searching for a good bottle of wine is not easy and might feel like rolling a dice. Luckily, there appears to be some correlation between ratings, prices, and countries. While it is impossible to make a 100% accurate prediction, devising a rule of thumb can be reasonable. If you live in California look for the U.S.-made wines and try to buy the most expensive you can reasonably afford, or download Vivino and check the reviews before buying.

There is a more efficient way to web scrape data from Vivino using their API. So, if someone wants to reproduce data collection, I would recommend going this route rather than the one I elected. Additionally, future development would involve incorporating weather data for the regions of the wines.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
source/figure-markdown_strict		source/figure-markdown_strict
source_files/figure-markdown_strict		source_files/figure-markdown_strict
README.md		README.md
source.Rmd		source.Rmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract

Introduction

Data Description

Vivino Web Scraping

Cleaning

Data Exploration

Empirical Models

Results and Diagnostics

Is the wine good?

So which country should I buy?

Conclusion

References

About

Releases

Packages

nikitoshina/Wine-Arbitrage

Folders and files

Latest commit

History

Repository files navigation

Abstract

Introduction

Data Description

Vivino Web Scraping

Cleaning

Data Exploration

Empirical Models

Results and Diagnostics

Is the wine good?

So which country should I buy?

Conclusion

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages