Revisiting Regression

 After discussing classification algorithms in last week's blog post, I felt that it would only be fair to also dedicate a post to explaining regression in greater detail.  Although my last two projects dealt with this type of problem, I never got around to discussing the particulars of regression.  So in this post, I will explain the primary uses for regression as well as describe a few specific methods.  

The primary difference between classification and regression problems is the output they seek to calculate.  As discussed in my previous blog post, classification problems generally seek to reach a discrete output, or in other words one of a few distinct values.  The example used last time dealt with the iris dataset that I'm currently working on.  Here, several plant measurements are taken as inputs which are then used to determine which of the three species is most likely to be the output.  This is known as classification because the predetermined output values can be seen as separate "classes".   Regression on the other hand seeks to find a continuous value.  To put it simply, rather than classifying between zero and one, a regression would be capable of calculating an exact value between the two numbers.  

As you might be able to guess, there are instances where one of these output types is more useful than the other.  Imagine that there were two iris species rather than three, and they were represented by zero and one like in the previous example.  Classification would return either a zero or one, giving a direct answer to which species was more likely.  Regression would return a number located between these two values, such as 0.65.  This number could possibly give us a clue as to which species is more likely (it is closer to one than zero after all), but divvying up these decimals based on which integer value is closer will likely be less accurate than classification, and also arrive at the same result once the values are rounded off.  Ending up in the same place with less accurate results is certainly not desirable here, especially when a few additional steps are required just to arrive at that point.  This would be a clear cut example of classification being more useful for a specific problem than regression.

Likewise, in many instances regression is much more useful for a particular dataset.  The most common case where it would be used is a situation where the output you are seeking is continuous.  Housing prices, for example, can't simply be broken down into one category for each specific price.  Additionally, if you are attempting to make future predictions there is no way of knowing where prices end up, meaning you would not know the output values that you sought.  You could of course split the house values into buckets, but this greatly sacrifices accuracy as the size of the buckets increase.  A house might fall into a bucket category between 500k and 600k, but you would not have access to the exact predicted value.  Generally speaking, if you are seeking to find a price or other continuous output, regression will be the more useful option.  

Funny enough, many of the algorithms I mentioned in my last blog post on classification can also be used for regression.  The only difference is that rather than giving a discrete output, they would give a continuous one.  I won't go into detail on these as they essentially function in the same way, but the algorithms I'm referring to here are decision trees, random forests, and k-nearest neighbors.  The two methods I will be discussing here are simple linear regression and polynomial regression. 

Simple linear regression is the most straightforward and, as the name suggests, "simple" type of regression.  It involves a single explanatory variable and an output, or dependent variable.  Imagine each point plotted on a graph, with the input on the x-axis, and the output on the y-axis.  Based on the correlation between the two variables, a line is then graphed which represents their relationship.  Specifically, the slope of the line is equal to the correlation corrected by the ratio of standard deviations between these variables.  The y-intercept is calculated by ensuring that the line passes through the "center of mass" for the given data.  This is represented by the mean x and y values.  The idea here is that for any given input on the x-axis, you can simply find the y-value of the line at that point.  This value then represents the output value that has been calculated.   

The other main type of regression is polynomial regression.  This is essentially a more complicated extension of the simple linear regression.  The main difference here is that multiple explanatory variables are used to calculate the output rather than just one.  Rather than diving in to an explanation of the math that comes into play here, it is much more useful to examine how the interpretation differs from the simple linear regression.  The short answer is that it does not differ all that much.  In a simple linear regression the single explanatory variable is plotted on the x-axis, and the output is plotted on the y-axis.  In a polynomial regression, a series of calculations in carried out on the various explanatory variables to convert them into a single x-value.  This is done by weighing how much each variable impacts the output.  Once this singular x-value has been calculated it is plotted along with the output values in the same way as the simple linear regression.  Likewise, a line is then plotted in an identical fashion.  The interpretation follows suit and functions in the same fashion as the simple regression, with the caveat that multiple explanatory variables must be input and then calculated to arrive at an x-value.  By finding the point on the line at this x-value, you can then find the associated output (y-axis value).

As can be seen here, the two primary, unique methods of solving regression problems are quite similar.  The situations in which they are employed, however, are very different.  Simple linear regression can only be used in the case of one explanatory variable, and with any additional variables it is no longer applicable.  Hopefully this gives some useful insight into how regression is used, and in which cases it might be preferable to classification. 

Comments

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection

Movie (Data) Magic: Reviewing the Critics