Author: Parker McColl, Jamie McColl
The beer industry has experienced explosive growth recently, fueled largely by growth in craft breweries. In fact, the number of breweries in the United States tripled between 2008 and 2016 and now tops over 7,000 (Brewers Association, 2019). The rapid growth in breweries can be seen in Figure 1.
With the rise of craft breweries, beer rating platforms such as Beer Advocate and Untappd were developed, providing a medium for beer enthusiasts to “check-in” and rate their drink, among other features. How can these data sources be used for guidance when opening a new craft brewery?
2.1. Process Tools
The following software was used for data cleansing, analysis, and visualization.
● Google Sheets
2.2. Data Sets
2.2.1. Individual Beer Advocate Ratings
This data set contained 1.5 million individual Beer Advocate reviews from 1996 to 2011, covering 37,770 unique beers, with 13 variables: beer name, brewery name, ABV, rating (overall, aroma, appearance, taste, palate), style, review timestamp, and more (“BeerAdvocate: Beer Reviews from the Beeradvocate,” 2016).
2.2.2. Overall Beer Advocate Beer, Untappd Beer, and Untappd Brewery Information
These data sets were obtained in March 2019 via a web scrape of the Beer Advocate and Untappd websites using RStudio. The code for these scrapes was based from Kunert (2017) but required modification due to updates in the actual website layouts, to obtain additional data, and to scrape breweries from Untappd.
The Beer Advocate beer data set contained 6 variables (brewery name, beer name, style, ABV, average rating, and total ratings) for 80,373 unique beers. The data set included all beers on Beer Advocate with 5 or more total ratings.
The Untappd beer data set contained 37 variables for the 5,109 most frequently rated beers from the scraped Beer Advocate beer data set. Variables included beer name, brewery name, ABV, IBU, average rating, style, check-ins, date added, description, and more. Some variables in this data set were joined from the Beer Advocate data set, and some variables were joined from the next data set, Untappd breweries.
The Untappd brewery data set contained 21 variables for the 883 breweries included in the Untappd and Beer Advocate scraped data sets. Variables included brewery name, parent company, class, average rating, check-ins, location, and more. Using Google Sheets, the location information obtained from Untappd was geocoded into geographic coordinates, country, state, and city.
The project was approached with the goal of producing valuable market research to consider when opening a brewery. This research focused on four main questions:
- How do the Beer Advocate and Untappd platforms compare?
- Where is the optimal location for a brewery?
- What factors contribute to an attractive and successful beer menu?
- What are the optimal times of the day, week, and year to host an event?
3.1. Rating Platforms
To begin the analysis, the two beer rating platforms were compared using common inferential statistics tools. The purpose of the comparison was to determine which site’s beer rating data should be used for further analysis.
The plots displayed in Figure 2 show a discrepancy in average beer ratings between the two sites. Histograms of both data sources show similar overall distributions, with Beer Advocate ratings skewed more towards the right. The QQ Plot shows that Beer Advocate beer ratings were more extreme than Untappd ratings. For beers rated over 3.0, Beer Advocate averaged higher than Untappd. However, for beers averaging less than a 3.0, Beer Advocate averaged lower than Untappd. The box plot shows differences in quantiles, with each quantile being higher for Beer Advocate than for Untappd.
To test if the beer data provided by each platform was equivalent, a paired t-test to compare the average beer Alcohol By Volume (ABV) was conducted. At the 0.01 significance level, the pvalue was 0.95, signifying that the difference in average beer ABV of the two platforms was likely equal to zero. A correlation test performed on the datasets produced a correlation coefficient of 0.99. From this, it was concluded that the beer information from both sites was equivalent.
From the histograms and box plots, Beer Advocate ratings trended higher than Untappd. As seen in Figure 3, the Beer Advocate mean beer rating was 0.14 higher than Untappd. This observation was further confirmed with a paired t-test. At the 0.01 significance level, the p-value for comparing the average beer rating of the two sites was 0.00, signifying that the difference in means of the two data sets was likely not equal to zero. While the means were not the same, the ratings were highly correlated, with a correlation coefficient of 0.92. Based on this, it is concluded that the two sites rate similarly, but not the same. Due to the significantly greater number of ratings, it was decided to use Untappd rating data for analysis.
3.2. Geographical Trends
The first geographical trend analyzed (Figure 4) showed the total number of brewery check-ins by country and state. The United States led the way for countries, with 357 million check-ins. Belgium (26.7M) and the United Kingdom (15.8M) were the next highest. For states, California (68.1M), Colorado (29.3M), and Michigan (25.5M) had the most total check-ins.
The average brewery ratings by country and state are shown in Figure 5. Europe and North America led again, with Switzerland (3.98), United States (3.76), and Sweden (3.71) topping the average brewery ratings. By state, Brussels (4.14), Oklahoma (4.06), and Iowa (4.04) rated the highest. Interestingly, the two states home to large macro breweries, Wisconsin and Missouri, have noticeably lower ratings than their surrounding states.
Breweries were then clustered using DBSCAN. A Nearest Neighbors Distance plot was used to determine the optimal epsilon value with 20 neighbors. As seen in Figure 6, the inflection point was located around 5, which correlated to about 345 miles. The brewery cluster map present in Figure 6 shows four large clusters worldwide. Three located in the United States: East Coast, West Coast, and Southwest, as well as the majority in Europe in one cluster.
Breweries were then segregated by class to visualize total check-ins and brewery rating, as seen in Figure 7. From the total check-in box and whisker plot, it is apparent that regional breweries gather the greatest total number of check-ins, followed by macro breweries, micro breweries and brew pubs. In the rating box and whisker plot, micro breweries had the highest average rated beers. Macro breweries had the largest range of ratings, and contract breweries and brewpubs had the smallest. Regional and micro breweries and brewpubs all had similar medians around 3.7, while the median of macro breweries was noticeably lower around 3.2. It is worth noting that the highest rated macro breweries are likely takeovers of previously micro or regional breweries.
Based on these results, the scope was narrowed to craft breweries in the continental United States. A DBSCAN clustering with 5 neighbors and epsilon corresponding to 52 miles was used to group these breweries into clusters, and sort by average cluster rank. The data, visualized in Tableau in Figure 8, shows 27 clusters that contain 393 of the 564 craft breweries in the continental United States. The top-ranked clusters were located near Miami, Florida; Tampa, Florida; and San Diego, California.
After considering this data, Southwest Michigan was chosen as the optimal location for a new brewery. The Southwest Michigan cluster was the first cluster for average total check-ins at 2.2 million per brewery; third state worldwide for total check-ins, with 25.5 million; fourth cluster for average rating at 3.88; and fourteenth state worldwide for average rating at 3.85.
Additionally, the surrounding area contained other clusters nearby, as seen in Figure 9, including Southeast Michigan, Chicago, and Madison, which suggested this broad region is a brewery hotspot.
3.3. Beer Menu
Beer Advocate provides the option for users to rate individual components of a beer: taste, palate, aroma, and appearance, in addition to giving the beer an overall rating. Using these four components, a model to predict overall rating was produced. A summary of this model is shown in Figure 10. All four prediction variables had p-values of 0.00, indicating that they were significant. From the prediction model, taste and palate carry the greatest weight in predicting the overall beer rating.
Taste, palate and overall reviews were then used to produce an interaction plot as seen in Figure 11. Two interesting interactions can be seen in the plot. First, a high palate review can make up for a low taste review. Second, a high taste review cannot make up for a low palate review.
Two common beer statistics, ABV and International Bitterness Units (IBU) were plotted against each other and colored by average rating (Figure 12). No significant trends emerged between the variables, but in general, as ABV and IBU increased, average beer rating also increased.
Beer styles were then filtered by the top ten highest average rating and top ten most frequently checked-in. As seen in Figure 13, there were three styles common to both graphs, New England IPA (highest rated, seventh most frequent), American Imperial Stout (second highest rated, fourth most frequent), and American IPA (tenth highest rated, most frequent).
The most common beer styles were then clustered by brewery location (for continental United States) using DBSCAN, shown Figure 14. The most common style, American IPA, is produced nationwide with most breweries included in a cluster. The next most common style, American Imperial IPA, shows a decrease in the number of breweries in a cluster. As the styles become less common, less breweries fit into clusters. An exception to this is the American Wild Ale style where most breweries producing this style exist in concentrated regions throughout the country, including a cluster in the Michigan area. The highest rated style, New England IPA, is the most regionally tied style with only one cluster existing in the New England region.
The top ten most frequent words appearing in the Untappd beer descriptions were filtered and colored by average rating as seen in Figure 15. The most frequent descriptive words were hop and malt, while the highest rated words were chocolate and sour.
Several concluding generalizations were made to aid in the planning of a potential beer menu. First, the taste and palate characteristics of a beer carry the greatest weight in its overall rating. Second, the words hop, chocolate, and sour should be emphasized in beer descriptions and flavor profiles. Third, beer with higher ABV and IBU tends to rate higher. Fourth, the four most frequently checked-in styles (American IPA, American Imperial IPA, American Imperial Stout, American Pale Ale) are considered menu necessities. Finally, the brewery can stand out by producing niche styles such as New England IPA and American Wild Ale. New England IPA was the highest rated style in the analysis as well as the fourth most frequently checked in. Southwest Michigan offers a potential emerging market for this popular style as currently its production is segregated to mainly the New England region. American Wild Ale was the sixth highest rated style and is already a commonly produced style in the Michigan area, with its own cluster present.
Using the Beer Advocate individual check-in data, heat maps were produced to visualize the optimal times to host events at the brewery. As seen in Figure 16, evenings from five to eight were the most popular times to check-in beers. On weekends, this time periods extends earlier around noon. Figure 16 also shows that weekends in November, December and January provided the highest check-in frequency. Additionally, any day of the week in December had a high frequency of beer check-ins.
From this data, it is concluded that the optimal time to host a brewery event is during the evening from five to eight, on a weekend in November, December or January, or any day of the week during December. In addition, December offers a prime time to open a brewery, with consistently high check-ins every day of the week.
The research and analysis performed in this project led to several key concluding points. The comparison of Beer Advocate and Untappd datasets shows that the two sites contain the same beer information and rate similarly, but not the same. However, Untappd is a far more utilized rating medium. The optimal location to open a craft brewery is concluded to be Southwest Michigan due to the high frequency of beer check-ins and high average brewery rating from breweries in this region. When developing a beer menu, it is critical to produce beers with both good taste and palate. When deciding on which beer styles to choose, focus on styles that are both highly rated and frequently checked-in: American IPA, American Imperial IPA, American Imperial Stout, American Pale Ale, New England IPA, and American Wild Ale. Optimal times to plan events are evenings, weekends and the month of December.
The data set used for this analysis includes some limitations that should be acknowledged. First, the scraped dataset does not include all beers and breweries in the United States or the world, so data was limited to only the most popular. Second, the individual ratings Beer Advocate dataset is outdated given the growth of the craft beer industry. Third, the correlation of beer check-ins and rating data to actual production or sales volume is unknown.
I’ve always wondered why New Glarus chooses to only sell their beer in Wisconsin and recently I found it that it is because they do not have the capacity to meet demand if they expand to other cities. Interesting that after over 25 years they haven’t expanded yet.
Very interesting information I have known about New Glarus my whole life and since it is so popular in Wisconsin I just never thought about it only being here.
Yeah this is interesting! It is so popular here and I know people back in Minnesota love it too! Wonder if they will ever expand
I really liked the visuals in this article. The data that was analyzed is unique and provided an interesting view on breweries.
I found this research to be a unique approach in determining where to start up a brewery. One concern I have with Southwest Michigan is the high average brewery rating. The high average rating concerns me because I view this as a large barrier to entry in the market. However, the high number of check-ins is promising. It would be interesting to investigate other regions in the U.S. with a similar number of check ins but a lower average rating because this could present an even better opportunity for a start up brewery to take over the local market.
I found this research very intriguing. I liked the visuals and thought that the dataset was very vast.
I thought what was most interesting was the heat map you provided in Figure 16 on the number of individual Beer Advocate check-ins (1996 – 2011) by day of week and month. It shows that Sundays in November, December, and January are the most popular times to visit breweries. Initially I was surprised by this because I assumed people would be more willing to go to breweries in the summertime but maybe it has to do with people wanting to watch football games on Sundays in the wintertime.
I think it is intriguing that that Milwaukee has some red area for the brewery quality. I have not been to a lot of breweries myself but I guess I always just assumed we have nice ones, although perhaps I would guess most of the ones in Milwaukee are older, so perhaps that has something to do with it.
For your first graph, the data starts in 1873, which is a historical peak of 4,131 breweries. I am curious as to what the data for number of breweries before 1873 would show and what it would look. I do not know if data before 1873 exists but I was quite surprised at how many active breweries were in America this early on in our history. I guess beer has really always been a large part of America’s culture. Overall, great work and great visuals!
Interesting how number of check-ins is significantly low even on Sundays in April and June, wonder why that is!
This analysis is pretty awesome. I do not find it surprising at all given the increase in popularity of IPAs and the hipster culture that has been arising for the past 10 years.
I found it interesting that an optimal time to have a brewery would be between 5-8. This makes sense since that’s when most people get off of work and are more likely to “throw a couple back”.
Thank you for your data analysis on breweries! I am very curious as to why Wisconsin has consistently low ratings, despite the fact that it is home to several large macro breweries, which you indicated as a positive factor. Could it be because the ratings are not suited to Wisconsin? For example, the taste of beer makes up the majority of the weighted rating, but not everyone likes the taste of Wisconsin beer. Even though it is carefully prepared, the taste is suited for Wisconsinites. This could open up further questions as to how different geographical regions rate their beer (possibly different weights).
A very interesting article about beer indeed. I am surprised that the USA ranks so highly in the beer ratings in comparison to some of the classical European counterparts. I also liked your approach towards DBSCAN clustering and clustering them based on their beer styles. Did you try Agglomerative or K-mediods clustering ? I’d reckon much better clusters could be formed if the “ward” or “complete” linkage hyperparameters were taken into consideration. This was just a side thought about the clustering though. Overall, I am thoroughly impressed by this article.
A very comprehensive market outlook for the craft beer industry. One thing I would question is if the number of check-ins is actually representative of the number of customers going to the breweries. Additionally, looking at which types of beer and areas have the largest tab (which beers make people want to buy more) would be insightful for the market outlook.
This is very intriguing. If you don’t mind me asking, what inspired you to write on this topic?
This article was extremely interesting and offered a lot of insight into all the criteria behind starting a brewery. I thought that the heat map of brewery ratings was really unique. The pattern of highly rated breweries around southern Michigan and Chicago surprised me because they outranked Wisconsin, and Wisconsin is known for its drinking culture.
Very interesting topic! Love how you made it so relatable to many different areas around the world. Cool to see that US had better ratings than places where you would expect otherwise.