The beer industry has experienced explosive growth recently, fueled largely by growth in craft breweries. In fact, the number of breweries in the United States tripled between 2008 and 2016 and now tops over 7,000 (Brewers Association, 2019). The rapid growth in breweries can be seen in Figure 1.
With the rise of craft breweries, beer rating platforms such as Beer Advocate and Untappd were developed, providing a medium for beer enthusiasts to “check-in” and rate their drink, among other features. How can these data sources be used for guidance when opening a new craft brewery?
2.1. Process Tools
The following software was used for data cleansing, analysis, and visualization.
● Google Sheets
2.2. Data Sets
2.2.1. Individual Beer Advocate Ratings
This data set contained 1.5 million individual Beer Advocate reviews from 1996 to 2011, covering 37,770 unique beers, with 13 variables: beer name, brewery name, ABV, rating (overall, aroma, appearance, taste, palate), style, review timestamp, and more (“BeerAdvocate: Beer Reviews from the Beeradvocate,” 2016).
2.2.2. Overall Beer Advocate Beer, Untappd Beer, and Untappd Brewery Information
These data sets were obtained in March 2019 via a web scrape of the Beer Advocate and Untappd websites using RStudio. The code for these scrapes was based from Kunert (2017) but required modification due to updates in the actual website layouts, to obtain additional data, and to scrape breweries from Untappd.
The Beer Advocate beer data set contained 6 variables (brewery name, beer name, style, ABV, average rating, and total ratings) for 80,373 unique beers. The data set included all beers on Beer Advocate with 5 or more total ratings.
The Untappd beer data set contained 37 variables for the 5,109 most frequently rated beers from the scraped Beer Advocate beer data set. Variables included beer name, brewery name, ABV, IBU, average rating, style, check-ins, date added, description, and more. Some variables in this data set were joined from the Beer Advocate data set, and some variables were joined from the next data set, Untappd breweries.
The Untappd brewery data set contained 21 variables for the 883 breweries included in the Untappd and Beer Advocate scraped data sets. Variables included brewery name, parent company, class, average rating, check-ins, location, and more. Using Google Sheets, the location information obtained from Untappd was geocoded into geographic coordinates, country, state, and city.
The project was approached with the goal of producing valuable market research to consider when opening a brewery. This research focused on four main questions:
- How do the Beer Advocate and Untappd platforms compare?
- Where is the optimal location for a brewery?
- What factors contribute to an attractive and successful beer menu?
- What are the optimal times of the day, week, and year to host an event?
3.1. Rating Platforms
To begin the analysis, the two beer rating platforms were compared using common inferential statistics tools. The purpose of the comparison was to determine which site’s beer rating data should be used for further analysis.
The plots displayed in Figure 2 show a discrepancy in average beer ratings between the two sites. Histograms of both data sources show similar overall distributions, with Beer Advocate ratings skewed more towards the right. The QQ Plot shows that Beer Advocate beer ratings were more extreme than Untappd ratings. For beers rated over 3.0, Beer Advocate averaged higher than Untappd. However, for beers averaging less than a 3.0, Beer Advocate averaged lower than Untappd. The box plot shows differences in quantiles, with each quantile being higher for Beer Advocate than for Untappd.
To test if the beer data provided by each platform was equivalent, a paired t-test to compare the average beer Alcohol By Volume (ABV) was conducted. At the 0.01 significance level, the pvalue was 0.95, signifying that the difference in average beer ABV of the two platforms was likely equal to zero. A correlation test performed on the datasets produced a correlation coefficient of 0.99. From this, it was concluded that the beer information from both sites was equivalent.
From the histograms and box plots, Beer Advocate ratings trended higher than Untappd. As seen in Figure 3, the Beer Advocate mean beer rating was 0.14 higher than Untappd. This observation was further confirmed with a paired t-test. At the 0.01 significance level, the p-value for comparing the average beer rating of the two sites was 0.00, signifying that the difference in means of the two data sets was likely not equal to zero. While the means were not the same, the ratings were highly correlated, with a correlation coefficient of 0.92. Based on this, it is concluded that the two sites rate similarly, but not the same. Due to the significantly greater number of ratings, it was decided to use Untappd rating data for analysis.
3.2. Geographical Trends
The first geographical trend analyzed (Figure 4) showed the total number of brewery check-ins by country and state. The United States led the way for countries, with 357 million check-ins. Belgium (26.7M) and the United Kingdom (15.8M) were the next highest. For states, California (68.1M), Colorado (29.3M), and Michigan (25.5M) had the most total check-ins.
The average brewery ratings by country and state are shown in Figure 5. Europe and North America led again, with Switzerland (3.98), United States (3.76), and Sweden (3.71) topping the average brewery ratings. By state, Brussels (4.14), Oklahoma (4.06), and Iowa (4.04) rated the highest. Interestingly, the two states home to large macro breweries, Wisconsin and Missouri, have noticeably lower ratings than their surrounding states.
Breweries were then clustered using DBSCAN. A Nearest Neighbors Distance plot was used to determine the optimal epsilon value with 20 neighbors. As seen in Figure 6, the inflection point was located around 5, which correlated to about 345 miles. The brewery cluster map present in Figure 6 shows four large clusters worldwide. Three located in the United States: East Coast, West Coast, and Southwest, as well as the majority in Europe in one cluster.
Breweries were then segregated by class to visualize total check-ins and brewery rating, as seen in Figure 7. From the total check-in box and whisker plot, it is apparent that regional breweries gather the greatest total number of check-ins, followed by macro breweries, micro breweries and brew pubs. In the rating box and whisker plot, micro breweries had the highest average rated beers. Macro breweries had the largest range of ratings, and contract breweries and brewpubs had the smallest. Regional and micro breweries and brewpubs all had similar medians around 3.7, while the median of macro breweries was noticeably lower around 3.2. It is worth noting that the highest rated macro breweries are likely takeovers of previously micro or regional breweries.
Based on these results, the scope was narrowed to craft breweries in the continental United States. A DBSCAN clustering with 5 neighbors and epsilon corresponding to 52 miles was used to group these breweries into clusters, and sort by average cluster rank. The data, visualized in Tableau in Figure 8, shows 27 clusters that contain 393 of the 564 craft breweries in the continental United States. The top-ranked clusters were located near Miami, Florida; Tampa, Florida; and San Diego, California.
After considering this data, Southwest Michigan was chosen as the optimal location for a new brewery. The Southwest Michigan cluster was the first cluster for average total check-ins at 2.2 million per brewery; third state worldwide for total check-ins, with 25.5 million; fourth cluster for average rating at 3.88; and fourteenth state worldwide for average rating at 3.85.
Additionally, the surrounding area contained other clusters nearby, as seen in Figure 9, including Southeast Michigan, Chicago, and Madison, which suggested this broad region is a brewery hotspot.
3.3. Beer Menu
Beer Advocate provides the option for users to rate individual components of a beer: taste, palate, aroma, and appearance, in addition to giving the beer an overall rating. Using these four components, a model to predict overall rating was produced. A summary of this model is shown in Figure 10. All four prediction variables had p-values of 0.00, indicating that they were significant. From the prediction model, taste and palate carry the greatest weight in predicting the overall beer rating.
Taste, palate and overall reviews were then used to produce an interaction plot as seen in Figure 11. Two interesting interactions can be seen in the plot. First, a high palate review can make up for a low taste review. Second, a high taste review cannot make up for a low palate review.
Two common beer statistics, ABV and International Bitterness Units (IBU) were plotted against each other and colored by average rating (Figure 12). No significant trends emerged between the variables, but in general, as ABV and IBU increased, average beer rating also increased.
Beer styles were then filtered by the top ten highest average rating and top ten most frequently checked-in. As seen in Figure 13, there were three styles common to both graphs, New England IPA (highest rated, seventh most frequent), American Imperial Stout (second highest rated, fourth most frequent), and American IPA (tenth highest rated, most frequent).
The most common beer styles were then clustered by brewery location (for continental United States) using DBSCAN, shown Figure 14. The most common style, American IPA, is produced nationwide with most breweries included in a cluster. The next most common style, American Imperial IPA, shows a decrease in the number of breweries in a cluster. As the styles become less common, less breweries fit into clusters. An exception to this is the American Wild Ale style where most breweries producing this style exist in concentrated regions throughout the country, including a cluster in the Michigan area. The highest rated style, New England IPA, is the most regionally tied style with only one cluster existing in the New England region.
The top ten most frequent words appearing in the Untappd beer descriptions were filtered and colored by average rating as seen in Figure 15. The most frequent descriptive words were hop and malt, while the highest rated words were chocolate and sour.
Several concluding generalizations were made to aid in the planning of a potential beer menu. First, the taste and palate characteristics of a beer carry the greatest weight in its overall rating. Second, the words hop, chocolate, and sour should be emphasized in beer descriptions and flavor profiles. Third, beer with higher ABV and IBU tends to rate higher. Fourth, the four most frequently checked-in styles (American IPA, American Imperial IPA, American Imperial Stout, American Pale Ale) are considered menu necessities. Finally, the brewery can stand out by producing niche styles such as New England IPA and American Wild Ale. New England IPA was the highest rated style in the analysis as well as the fourth most frequently checked in. Southwest Michigan offers a potential emerging market for this popular style as currently its production is segregated to mainly the New England region. American Wild Ale was the sixth highest rated style and is already a commonly produced style in the Michigan area, with its own cluster present.
Using the Beer Advocate individual check-in data, heat maps were produced to visualize the optimal times to host events at the brewery. As seen in Figure 16, evenings from five to eight were the most popular times to check-in beers. On weekends, this time periods extends earlier around noon. Figure 16 also shows that weekends in November, December and January provided the highest check-in frequency. Additionally, any day of the week in December had a high frequency of beer check-ins.
From this data, it is concluded that the optimal time to host a brewery event is during the evening from five to eight, on a weekend in November, December or January, or any day of the week during December. In addition, December offers a prime time to open a brewery, with consistently high check-ins every day of the week.
The research and analysis performed in this project led to several key concluding points. The comparison of Beer Advocate and Untappd datasets shows that the two sites contain the same beer information and rate similarly, but not the same. However, Untappd is a far more utilized rating medium. The optimal location to open a craft brewery is concluded to be Southwest Michigan due to the high frequency of beer check-ins and high average brewery rating from breweries in this region. When developing a beer menu, it is critical to produce beers with both good taste and palate. When deciding on which beer styles to choose, focus on styles that are both highly rated and frequently checked-in: American IPA, American Imperial IPA, American Imperial Stout, American Pale Ale, New England IPA, and American Wild Ale. Optimal times to plan events are evenings, weekends and the month of December.
The data set used for this analysis includes some limitations that should be acknowledged. First, the scraped dataset does not include all beers and breweries in the United States or the world, so data was limited to only the most popular. Second, the individual ratings Beer Advocate dataset is outdated given the growth of the craft beer industry. Third, the correlation of beer check-ins and rating data to actual production or sales volume is unknown.