Author: Michael Hickey, Ben Razidlo

## I. Introduction

New information processing and data analytics techniques will be explored to augment insights traditionally provided by the census process. Census information has been long used as a mechanism for acquiring and recording information about the members of a given population. Data-age information processing technologies, image recognition, and prediction algorithms can provide new opportunities to build models and make inferences regarding a population’s current situation and future trends.

The overwhelming scale of available data in today’s world often outweighs the ability of stakeholders to utilize and understand the insight it can provide. The overarching goal of the analysis is to explore what can a city’s buildings tell us about its inhabitants, and from this provide a recommendation for real estate investment. Specifically, its two objectives are:

**Neighborhood Real Estate Trend Analysis**– Explore the use of this data set and data analytics tools to inform a potential business decision**Building Geometry Correlation Analysis**– Identify which building geometry parameters are the strongest predictors of the demographic, economic, and the health and well-being characteristics of the neighborhood’s inhabitants

## II. Analysis

### 2.1. Dataset description

#### 2.1.1. Microsoft building data

In 2018, Microsoft released computer-generated building footprint data provided for all 50 US states [GitHub: USBuildingFootprints]. This effort leverages a deep convolutional neural network (DNN) architecture, ResNet34, developed at Cornell University for object recognition and dense classification problems. Satellite imagery is first processed by ResNet34 imagery and then interpreted as polygons using a rule-based prediction algorithm.

This data is freely available for download and use. In October, 2018 the New York Times published an interactive article utilizing visualization of the data across different locales in the US [link]. Specific geometric parameters than can be derived from this data include building area, perimeter, roundness, number of corners, people per building and percentage area covered by buildings.

#### 2.1.2. Washington DC Neighborhood Data

An expansive data set from a wide variety of sources has been stitched together by Urban- Greater DC [Data Explorer, Methodology and Sources, Shapefiles], an organization funded by the Urban Institute that curates, cleans and stores data grouped by small geographic areas corresponding to 39 distinct neighborhood groups in the city. Among the data are shapefiles of the neighborhoods, and corresponding statistics on demographics, income, education, health, and well-being. In all, there are 348 metrics for each neighborhood group, including individual metrics extracted at different time periods (e.g. Population 2000, Population 2010), and those derived by the authors. The joined building and neighborhood data are depicted below in Figure 2.

### 2.2. Software Used

To clean simplify and map together the data from USBuildingFootprints and Urban-Greater DC, the authors used R including the libraries “sf” for reading shapefiles, and “geojsonio” for reading geojson files. The “ggplot” and “ggmap” libraries were utilized to produce some of the maps and plots in this report and the associated slides. R was also used to apply clustering methods to group the data, and to mine the data for correlations. Once the data were cleaned and joined, the authors utilized Tableau 2018.3 to visualize the data.

### 2.3. Approach

#### 2.3.1. Data Conditioning

The applied workflow of reading in the datasets, cleaning, and joining the datasets (using R) is as follows:

- Read in building shapefiles
- Group points (building corners) by building
- Calculate geometric parameters (e.g. area, perimeter)

- Read in neighborhood shapefiles
- Derive area of each neighborhood

- Determine the neighborhood each building is located in
- Map building data with neighborhood data by neighborhood identifierDerive additional parameters that require metrics from both datasets (e.g. people per building)

Using this process, the authors produced visualizations of the buildings within each specified neighborhood group. Figure 3 depicts the neighborhoods of Downtown, Adams Morgan, and Anacostia, each of which consist of visually different building styles, shapes and patterns.

#### 2.3.2. Clustering Neighborhoods by Wealth

To assess investment opportunities, neighborhoods were clustered as a function of wealth for independent analysis. Wealth is one of the strongest attributes that define a neighborhood’s characteristics and will provide a strong discriminator for analysis of investment opportunities. The most recent demographic data available, Average Family Income from 2007-2011, was leveraged from the Urban-Greater DC data site and was grouped into wealthiest, average, and lowest income by a K-Means clustering and verified with a hierarchal analysis.

The clustering is visually satisfying, but was confirmed using agglomerative hierarchical clustering algorithm, AGNES, again using R.

The two clustering methods produce nearly identical results; the only difference being the 36th ranked neighborhood, when clustered into 3 wealth categories. For the sake of the rest of this project, the 36th ranked neighborhood will be categorized in the wealthiest bracket.

#### 2.3.3. Correlation Methods

To further simplify the data, nearly all of which was represented as continuous numeric data, the k-means clustering method described in the previous subsection was applied to all metrics with a specification of 𝑘 = 3. This converted the continuous data into groups of 3 discrete ranges. The approach for correlating the 6 building geometry parameters with the 348 neighborhood demographic parameters required an assessment of each of the 2048 potential correlations. The Chi-Squared method was applied to each pair of metrics, and those with a P value of under 0.05 were considered statistically significant. The approach is depicted below in Figure 6.

## III. Results

### 3.1. Neighborhood Real Estate Trend Analysis

To inform potential investment decisions, the time-history neighborhood growth in each wealth category was evaluated. Average family income for each neighborhood from 1980, 1990, 2000, and the most recent, 2007-2011 was binned and averaged as a function of the populations. A trend line was fit to each category and extrapolated into the future.

The trend lines provided in Figure 6 are 2nd order polynomial regression fits. It is appreciated that it is very risky forecasting real estate demand, particularly this far into the future and additional research with professional relators in the DC area is advised before acting on this data. But using what is available suggests that the strongest growth is expected in the middle-class neighborhoods.

The expected neighborhood growth can be assessed with the neighborhood’s current building density (Sum of Building Areas / Total Neighborhood Area) and building inhabitant density (Neighborhood Population / Number of Building).

All income categories were evaluated as shown in Figure 7, but the middle-income category is provided because of the strong trend in population growth and this category shows the highest building density and inhabitant density of the 3 groupings. The neighborhoods are tightly packed (buildings touching each other) there’s a higher density of higher residence structures like apartment buildings. This is of interest, because if the populations of these neighborhoods continue to grow as the trends predict, living space will come become scare.

A drawback of the analysis used is that the data does not discriminate residential buildings from commercial or government buildings. Also, it is unclear how well the image recognition algorithm can discern buildings that are so closely adjacent. Yet, the inspection from satellite imagery corroborates the results well.

### 3.2. Building Geometry Correlation Analysis

Of the 2088 possible relationships, 399 were determined by the chi-squared test to be statistically significant. Figure 5, below plots the number of statistically significant correlations of each of the six aggregated building geometry metrics. This shows that in many cases, the geometry of buildings in a neighborhood reveals insight of its inhabitants and their societal traits. Figure 9 depicts three of these functional relationships between the neighborhood’s buildings and its demographic, economic and well-being characteristics. The upper left plot shows that neighborhoods with higher percentage building area tend to have higher rates of property crime. The upper right plot shows that neighborhoods with higher average building areas have experienced greater economic growth between 1990 and 2010. Lastly, the bottom plot shows an inverse relationship between mean building corners and percentage of children- neighborhoods with simplistic buildings tend to have more children.

## IV. Conclusion

It has been demonstrated that data from many sources can be incorporated to provide meaningful insight into complex associations. This project has explored how satellite imagery can be coupled with census-like data so that we might learn what a city’s buildings tell us about its inhabitants.

This data has been conditioned and analyzed to show how middle-income

neighborhoods have a growing population; and the people currently living them are the must tightly packed of the groups evaluated. From this analysis, a recommendation can be made that middle-income neighborhoods are the strongest candidate for real estate investment.