Group Project for INFO 201
Kyle Simpson Sojin Park Taehyun Kwon Mitesh Goyal
###Data Description: From our first day as a group, we realized that all four of us love coffee (hence our fantastic group name). It became immediately apparent that whatever data we decided to analyze had to be related to coffee in some way. A few days after our initial discussion about our basic topic, we found the International Coffee Organization’s website which contained an incredible amount of historical data about the coffee industry. While the ICO (International Coffee Organization) does have data going all the way back to 1965, they are currently only releasing data for free from 1990 to present, which is still more than enough data for us to work with. The ICO’s data originally comes in the form of Excel spreadsheets describing annual data on everything from total crop production per country to retail price per country, all of which was collected and compiled by the ICO. An analysis of this data could be valuable to a couple of different audiences. First, it could be useful to companies working in the coffee industry. Two of the data tables we will be incorporating are annual consumption and retail price which could yield interesting connections when considered simultaneously. Companies could use this information to predict future retail prices and adjust their current prices accordingly. Second, an analysis of this data could be useful to consumers of coffee. Given that coffee companies could use this data to increase or decrease their prices, consumers could also use this data to find the mean price for a cup of coffee and base their purchasing decisions from there. While coffee companies are the ones to set the prices, it is consumers who ultimately decide where they spend their dollars which sets the consumption rates. Wethink both companies and consumers could gain a good amount of knowledge from an analysis of the coffee industry. Our data set could answer a number of questions for a company including: 1. What are the growers being paid for the beans, and how does that impact how much we am being charged by the manufacturer? and 2. What are country-wide mean coffee prices and how can we alter my prices to get more business? Our data could also answer some questions the consumer may have, including: 1. What is the median price for a cup of coffee in my country and how does that compare to where we usually purchase coffee? 2. Is there a reason coffee prices are what they are, and how does this price relate back to the growers of the beans? and 3. How do we contribute to the total consumption of coffee in my country? In our analysis, we will specifically be trying to answer these questions, and others, in order to create a detailed report that is useful to a wide audience of coffee lovers. EDIT: Plot of Prices paid to grow (x-axis: calendar year, y-axis: price), overlay it with retail prices & the consumption rate. This will answer the questions such as “How does the prices to grow coffee beans affect the retailers and how that leads to skewness/increase/decrease in consumption rate?”
###Presentation Description
The format for our presentation will be a knitted HTML webpage. We believe that an HTML page will be the best presentation of the data we plan on analyzing because we predict much of our analysis will be presenting various data tables and graphs of information. We will be using the Tidyverse package for most of this report since it contains Knittr, Dplyr, Ggplot, and other useful packages for data manipulation and presentation. We decided against using the Shiny package because we did not feel that a highly interactive visualization would be an effective representation of our data. We will also be using the Plotly package because although we don’t want a completely interactive page, some graph interaction may become useful for our analysis.
While we am not sure the exact statistical analysis we will present for the final product, my assumption is that the primary analysis will be means, medians, maxima, and minima for various countries and years. We do not expect any incredibly involved analysis, however we would not doubt my fellow group-mate’s ability to critically assess what kinds of statistical analysis would best present the data we are analyzing. Using our analyses we will be utilizing Ggplot to make graphs of the data in order to compare various countries against each other, as well as compare the various data sets provided by the ICO against each other to notice patterns which will begin answering the questions our audience may have.
As for challenges we may face, one issue we foresee is the initial wrangling of each data set since we are given highly formatted Excel spreadsheets which we need to convert into CSV files. We converted one of these files to CSV to see what kinds of initial wrangling we need to do, and it already looks pretty complex. Columns are not named according to what they were in Excel, there are a lot of blank values where we would typically see an “NA” value, there are a lot of completely blank rows which were meant to differentiate tables in Excel, just to name a few complications. Wethink one potential way to solve this problem is editing the data in Excel before converting it into a CSV file, but we will troubleshoot through all of these issues as we go. Wethink that after the primary data editing things should be relatively smooth since the whole group is well versed in R and handling merge conflicts.
To summarize, our group will be creating a report analyzing data provided by the International Coffee Organization that we hope will be able to answer various questions both coffee consumers and coffee businesses may have about the industry as a whole. We will be utilizing two prominent R libraries (Tidyverse and Plotly) to perform our statistical analysis and to present our findings, both of which bring their own complications and challenges which the group is well equipped to handle.