Predicting Housing Prices with ARIMA
For my most recent project, I was assigned the task of analyzing a set of housing prices from various ZIP codes. The goal was to find the top five ZIP codes for investment using ARIMA modeling. I'd like to use this space to walk through my thought process while working through the problem, the methods I used, and give particular attention to the exploratory data analysis step because it was very interesting in this case.
The first point of interest with this problem is that question is intentionally vague and required me to come up with a few answers before even attempting to answer it. First of all, in order to evaluate the best investment it is necessary for me to choose a metric to use. In this case, I've decided to use return on investment to measure the profitability, and adjust it based on risk. I've chosen five years as the time frame to target since this is a fairly standard investment period and it's short enough that the model can still predict with decent accuracy. Next, it is also important to choose a metric for evaluating my models. The one I would like to focus on is the Akaike information criterion, or AIC. This error metric is nice because it takes into account overfitting and underfitting, so by relying on it we can avoid both of these.
Now that those metrics are in place, I'd like to give a brief overview of the data. This datatset from Zillow is pretty straightforward, and contains a monthly time series of prices for each ZIP code. This spans from April 1996 to April 2018. I've also pulled in a bit of additional data from the Census Bureau so that I could have the population for each state in both 1996 and 2018. In order to narrow down the dataset so that I could spend more computation time on the relevant ZIP codes, I filtered out a large portion of the data based on ROI between 1996 and 2018. This is the largest dataset I've worked with so far by a large margin, so it is definitely important to cut down on irrelevant data wherever possible here.
At this point it was time to move into the most interesting part of this particular proejct, exploratory data analysis. My main goal here was to check for any obvious patterns or problems that might be hidden in the data. Given that this was time series data, and the first that we've worked on for a project, there were some new techniques that I was curious to try. First I took a general look at the filtered dataset by plotting the average of the time series in red below. I also plotted a few random time series from the filtered data in blue and the average of the unfiltered time series in green. This gave me a general idea of the trend in prices during this period, notably a consistent upward trend with a significant dip during the financial crisis.
Next, I wanted to see if there were any states disproportionately reflected in the filtered dataset. Below, I made a scatterplot showing percent population change on the x-axis and percent change in home price on the y-axis. The dots are randomly colored and their size is dependent on their adjusted share of the dataset. The goal here is to see if there is any obvious strong relationship between the share size and either of the two axes. The graph seems to show a weak relationship but nothing disruptive to the analysis.
For the next set of graphs below I wanted to check for any strong seasonal trends. To do this, I showed a time series that covered one year, and plotted each year separately on the graph. Interestingly, this initial look didn't show any strong seasonal relationship. Rather, it showed that most years trended steadily upward, with some sort of confusion between $600,000 and $800,000. Thinking back to the longer time series it was fairly clear what was happening, so on the next graph I made red lines for the years affected by the financial crisis. This makes for a much clearer picture with every line outside of this period trending up, and the affected years all trending down. An even clearer picture than that appears when the beginning of each series is set to zero, as in the third graph.
Now for a brief overview of the steps I took in modeling this problem. First, I tuned the parameters of an ARIMA model so that it was optimized for the filtered average (red) line from the first graph. The idea was that once I had a model that should fit most of the individual ZIP codes at least somewhat I could use the model to make predictions for each ZIP. Next, I took those predictions and calculated the risk adjusted ROI for each ZIP, and sorted them by that value. I took the top few ZIP codes and tuned a model for each one so that they were all optimized. I then compared the risk adjusted ROI for these ZIP codes, and this is what gave me my final results.
Overall, I found working with time series to be a lot more interesting than classification. It opened the door to a lot of exploratory data analysis ideas that I hadn't really thought about before. It was also an interesting project because of how massive the dataset was compared to what I've worked with before, especially considering how long it can take to run an ARIMA model. This has definitely been a first for me in terms of having to hyperfocus on computation time and write the most efficient code possible due to how long everything would take if I didn't. It's actually a bit unfortunate that this is the project where this problem presented itself, because I think the ability to do more grid searches and tune more models really would have helped hone my results here. But regardless, I was very happy with how everything turned out, and this project has added some interesting exploratory data analysis techniques to my toolkit which I think will come in handy when working with time series data in the future.
Comments
Post a Comment