Clarifying Classification

This week I have decided to take an in depth look at classification problems in data science.  Since most of my recent projects have dealt with regression, it seems prudent to thoroughly examine this other equally important aspect of data science.  I will give a brief overview of classification, explain a few important algorithms, and walk through my decision-making process when choosing a topic for my next project.  Finally, I will give a brief outline of my plan for this new undertaking.  

Between classification and regression, the former strikes me as a much simpler concept to wrap my head around.  An example of regression would be using historic data to predict future prices of something like a house.  Various inputs are taken into account and weighed based on the strength of their relationship to the output, which can then be used to estimate future data points.  Classification is very similar in that it also weighs inputs in this manner, but rather than giving an output along a continuous axis, it instead returns a discrete value.  In other words, rather than predicting an exact price, a classification problem for a similar housing price data set might tell you whether or not a house will be worth over or under five hundred thousand dollars based on the square footage, number of rooms, lot size, location, etc.  The key feature of classification problems is that they have a set number of possible outcomes.  That previous example would be a binary classification problem because there are two possible outputs.  

This leads to many instances where a data scientist might choose classification rather than regression.  Most importantly, it should be utilized if the output they seek to predict is a discrete value.  Another situation where classification can be useful is when time does not affect the outcome.  Take the housing price example for instance.  Now imagine that I only had access to 2021 data for training.  The idea of using classification in the manner I described works fine if the only data that will be examined is all from the same year, but what if I tested the classifier on housing data from the 1980 when prices were generally much lower?  The result would be very inaccurate classification, since many homes worth over five hundred thousand dollars today likely would have been worth much less in 1980.  Classification is better suited towards a goal like predicting which incoming emails are spam.  Here there is a binary outcome and it remains largely unaffected by the time of the email.  

There are many different types of classifiers with various strengths and weaknesses, so I would like to give a quick explanation of a few of the most popular examples as this will be a topic covered by my upcoming project.  

Perhaps the most intuitive of these classifiers is the decision tree.  This model takes the shape of, as you may have guessed, a tree.  The structure is similar to that of a flow chart, with each node representing a test that is performed on the data, with the first one being known as the root node.  Each node has two branches which represent the outcome of the test.  This can then lead to another node with another test, until eventually arriving at an output value.  Once every attribute has been tested each path will result in an output value based on the relationship between the inputs and their tendency to "pass" or "fail" each test.  An oversimplified example would be a root node called "coin flip" with two branches leading two leaf nodes simply labelled heads and tails.  Expanding on this, imagine if two coins were flipped.  Instead of ending at the previous leaf nodes these would each have two branches leading to one of the four possible outcomes (two heads, heads then tails, tails then heads, two tails) and these would now be the leaf nodes.

An expansion on the decision tree is another classification algorithm known as a random forest.  As the name suggests, this classifier takes advantage of many trees rather than just one.  This is done primarily to counteract the main weakness of the decision tree, which is that they have a tendency to overfit to their training sets.  Essentially, when a tree becomes sufficiently deep by testing a large number of attributes, it can tend to be overly specific to the training data.  This leads to a high level of variance once the test data is introduced.  The random forest attempts to solve this problem by averaging out the results of trees from different parts of the training data.  This helps removes overly specific patterns that an individual tree may pick up.  

Another popular type of classifier is k-nearest neighbors, also known as k-NN.  In order to picture how this one works it is helpful to think of a simple graph.  The two axes of the graph represent features (or inputs) that the algorithm is using to predict the output.  Now all of the training data is plotted on this graph based on the values of these two features, with one outcome in blue and the other in red.  When the test data is introduced, it can be thought of as each point being graphed in the same way.  The prediction is then made based on whichever outcome is closest to the test data on the graph, hence the name nearest neighbor.  The "k" part comes into effect here too, as this parameter determines how many neighbors must be counted.  For example, let's say that the k value is five.  The test data point is plotted and the nearest neighbor is red.  Tally one point for red.  The next closest point is blue, so one point is tallied for blue.  This process repeats itself until one outcome reaches a "score" of five, and the test data point is then predicted to be that color.  This entire overview is a bit oversimplified however as there can be more than two features, but the general idea remains the same.  

The final type of algorithm that I will briefly touch on is the Naïve Bayes classifier.   Rather than being just one algorithm, this term generally refers to a family of similar classification models all operating under the same assumptions.  Namely, that the value of a particular feature is independent of the value of any other feature.  So in the simplest usage of Naïve Bayes, a probability table is created for each feature.  Then when test data is entered the probability is calculated for which output is the most likely based on its features.  In theory this classifier seems oversimplified since the assumption that all features are independent is often incorrect, yet it has still proven to be effective in many real world examples.  

Now that I've covered the basics and explained a few examples of classification, I will discuss how it relates to my current project.  The dataset that I have decided to use is the R.A. Fisher Iris Data.  This dataset contains measurements of various parts of flowers (specifically the iris plant).  There are a couple of specific reasons that I chose this data.  First and foremost, the plants are divided into three classes based on the specific species, which makes this a perfect problem for classification.  Additionally, the dataset is quite small.  Usually I would see this as a negative, but for my specific purposes a smaller dataset might actually be more useful.   

The reason I prefer the small dataset in this case is because my plan for this project is to iterate through many different classifiers in order to test their performance.  Using a massive amount of data would make this far more time consuming.  There are two goals that I will set out to achieve with the project.  First, what is the most accurate classifier that I can produce?  Second, how do the various classifiers stack up against each other?  While the first question is generally more practical, I'm much more interested in the second question, where I plan on doing a much deeper analysis.  As someone who works with data and machine learning, I feel that it will provide useful insight into the differences between these classifiers.  

Comments

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection

Movie (Data) Magic: Reviewing the Critics