Location, Location, Location: Real Estate Data Analysis

For my most recent data science project I was tasked with creating a business case related to a provided dataset, and then solving it.  The data in question here described a few years of home sales in King County, Washington, along with plenty of variables describing each home sold.  I knew right away that I wanted to use price as my outcome variable, since most business cases that came to mind would use it as the deciding factor.  Examining the rest of the variables in the dataset, most had a fairly obvious relationship to price (at least when looked at from a high level overview).  For example, I figured that more bedrooms, bathrooms, or square footage would all lead to higher prices, so these didn't really interest me for a business case.  What caught my eye instead was the location data associated with each entry.  Every house had fairly precise latitude and longitude coordinates as variables, as well as a ZIP code.  These are two components that I thought did not have any obvious relationship with price, so I was interested in seeing if that was true in practice.  With that I had my business case.  I would create a model to predict home prices for potential buyers, with an emphasis on geographic location being used to maximize value.  There were three specific questions that I planned on answering.  

1.  What is the most cost-effective location in King County?
2.  How much money can be saved based solely on location?
3.  Are there any other savings opportunities hidden in the data?

My first order of business was to create some new variables.  There were 70 different ZIP codes included in the dataset, which made visualizing them somewhat messy.  To remedy this, I ranked them in order from lowest average price to highest.  I then split this ranking into ten groups of seven and called the resulting variable ZIP group.  This resulted in a much more legible visualization.  Because I already had latitude and longitude, viewing the data in map format was fairly straightforward.  With the ZIP groups providing some clarity on pricing, I was able to estimate an "epicenter" of sorts where pricing seemed  highest.  Below is one of my initial charts which is essentially displaying a map of the county with a darker hue assigned to higher ZIP groups.  The epicenter is marked as well with a red X. 
Satisfied with my estimate, I decided to create a new variable for each entry called 'distance'.  Here, I simply used the Pythagorean theorem to calculate the distance between the epicenter and each entry according to their latitude and longitude.  The thought was that this should have a fairly linear relationship with price since the further from the epicenter I looked on the map, the lower price tended to be.  

Time is always an interesting aspect of data to explore, and I had a hunch that there might be some hidden savings associated with it, but the issue with this dataset is that it only spanned a couple of years.  I doubted that there would be any meaningful time series relationships over that short span.  Instead, I decided to examine time as a categorical variable.  Since every entry had a sale date associated with it, I was able to assign a 'season' value to each entry based on the month.  For reference, December through February was winter, March through May was spring, June through August was summer, and September through November was Autumn.  Once each season was assigned as a variable from zero to three, I used this column to create three dummy variables.  I created three (for summer, winter, and autumn) instead of all four in order to avoid perfect multicollinearity.  

So with my ZIP group, distance, and season columns, I felt that I was ready to create the model using an OLS regression.  In order to ensure a well fit model, I tested each of the other variables with linear regression to see which would make the best predictors.  I included a few of the best fitting variables along with my own new ones, and the resulting model had an adjusted R-squared value of .762.  I was comfortable with this score and model, so the next step was to see what I could learn from it.  

The first issue to address was finding the most cost-effective location in the county.  I now had coefficients for ZIP group and distance, so I decided to group the data by ZIP code and find the one with the lowest result when these two features are combined.  The members of each ZIP code all had the same ZIP group, and for the distance I simply used the mean value.  I then multiplied the ZIP group by its coefficient, and did the same with the mean distance value for each ZIP code.  What this resulted in, essentially, was a "discount" or "premium" compared to the base model, based solely on location.   From here it all I needed to do was find the minimum and maximum values in order to see which ZIP code was the most or least cost effective.  I've mapped out the "winner" and "loser" below - ZIP code 98023 had the lowest cost associated with it, and 98039 had the highest.  
With all of this complete, I was ready to answer the three questions from earlier.

1.  What is the most cost-effective location in King County?

As previously mentioned, this would be ZIP code 98023.  The predicted discount for this area came out to be $141,602.  This is the ZIP code I would recommend to home buyers looking to maximize cost effectiveness.  

2.  How much money can be saved based solely on location?

In order to answer this, I also needed to examine the least cost effective ZIP code, 98039.  Rather than a discount, this area had a predicted premium compared to the base model, and it came out to $325,254.  This means that the difference between the minimum and maximum predicted values is $466,857, and this is the largest amount that the model predicts can be saved solely by choosing one location over another.  The recommendation here would be to avoid ZIP code 98039 if the buyer is aiming to maximize cost-effectiveness.  

3.  Are there any other savings opportunities hidden in the data?

Yes, the hunch about time data turned out to be correct!  According to the model, spring is the most expensive season to buy a home.  Choosing any season besides spring will result in saving ~$20,000, with some slight variation between the remaining seasons.  I would recommend purchasing in autumn, as this had the highest predicted savings at $24,048.  Below is a graph showing the savings for purchasing in each season instead of spring.  
As can be seen, there is a lot of useful information that can be gathered solely from geographic location.  With a little creativity, it can be used as an effective predictor of home prices, and can definitely make a large difference in cost-effectiveness when purchasing a home.  While this specific model only applies to King County, I hope this post has shown that it can always be helpful to examine trends and explore data.  Who knows, it could even save you some money!




Comments

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection

Movie (Data) Magic: Reviewing the Critics