Iris Classification

 


In 1936, Ronald Fisher published his paper The use of multiple measurements in taxonomic problems, the basis of which was a small data set containing some attributes of iris flowers.  The data set is very straightforward, and describes 150 flowers with four measurements each:  sepal length, sepal width, petal length, and petal width.  Additionally, it lists the species of each iris, of which there were three different types equally represented in the data.  These species are Iris setosa, Iris virginica, and Iris versicolor.  Despite the simple nature of this data set, it would go on to become one of the most commonly used within machine learning, and it remains a standard test case for classification techniques to this day.  It is also the data set that I have decided to examine for my most recent project. 

Now to explain a bit about this project.  The primary impetus for it was very basic:  I have worked almost exclusively on regression problems recently, so I felt that it would be useful if I returned to classification for one project.  The goal is that it would serve as both a refresher and a deeper dive into this type of problem, with a clear overarching focus on learning.  The structure and methodology that I decided on for the project reflects that same objective.  The easiest way to illustrate this is with the data set that I've chosen.  The iris data set is both small and relatively tidy.  These two factors bring with them two benefits that closely align with my stated goal.  Because the data set is small, computation time will be very short.  This will allow me to focus more on my classification models along with their hyperparameters, and less on making small optimizations and worrying that tests or grid searches will be too time-intensive.  The tidiness of the data set serves a very similar purpose.  Since the data is all accounted for and relatively straightforward, there is very little data cleaning required.  This leaves more time to focus on the more pertinent aspects of the project.   

As I mentioned earlier, the methodology I've chosen also contributes to the larger purpose here, which is gaining a deeper understanding of classification techniques.  Since I have yet to give an overview of my strategy, I will explain that now as well as the reasoning behind it.  The basic idea is that I will be attempting to classify the flower samples according to their species as accurately as possible.  Rather than simply using one model and focusing on tuning it to perfection, the small dataset allows me the freedom to test multiple different models and tune each one to a high degree.  This grants me the ability to examine many types of algorithms to find out not only which ones outperform the others, but also how the individual parameters within the models affect performance.  Beyond the primary goal of classifying the flowers as accurately as possible, I would also like to place some emphasis on providing a detailed comparison of the various models.  

Now I will briefly delve into the methods I will use to evaluate and compare the various models.  There are two main techniques that I will employ.  The first is simply calculating the testing and training accuracy of each model.  Training accuracy describes the accuracy of the model on the data it used for training, as the name suggests, while test accuracy refers to the same thing except for the data the model hasn't seen yet.  This test accuracy figure will be the most important factor in determining which model performs the best.  Training accuracy is less important, but it can be useful in determining whether a model might be overfitting to the data.  The next technique, a confusion matrix, is more useful for visualization than evaluation, but I think it can help illustrate the difference between the models' performances.  A confusion matrix is a grid showing the classifications predicted by the model matched against the correct classifications.  Oftentimes these are color coded according to frequency in order to show whether a model is making a certain type of prediction more often than expected or simply to highlight accurate performance.

There are many different types of classifiers with various strengths and weaknesses, so I would like to give a quick explanation of a few of the examples that I will be using in my project.

Perhaps the most intuitive of these classifiers is the decision tree.  This model takes the shape of, as you may have guessed, a tree.  The structure is similar to that of a flow chart, with each node representing a test that is performed on the data, with the first one being known as the root node.  Each node has two branches which represent the outcome of the test.  This can then lead to another node with another test, until eventually arriving at an output value.  Once every attribute has been tested each path will result in an output value based on the relationship between the inputs and their tendency to "pass" or "fail" each test.  An oversimplified example would be a root node called "coin flip" with two branches leading two leaf nodes simply labelled heads and tails.  Expanding on this, imagine if two coins were flipped.  Instead of ending at the previous leaf nodes these would each have two branches leading to one of the four possible outcomes (two heads, heads then tails, tails then heads, two tails) and these would now be the leaf nodes.

An expansion on the decision tree is another classification algorithm known as a random forest.  As the name suggests, this classifier takes advantage of many trees rather than just one.  This is done primarily to counteract the main weakness of the decision tree, which is that they have a tendency to overfit to their training sets.  Essentially, when a tree becomes sufficiently deep by testing a large number of attributes, it can tend to be overly specific to the training data.  This leads to a high level of variance once the test data is introduced.  The random forest attempts to solve this problem by averaging out the results of trees from different parts of the training data.  This helps removes overly specific patterns that an individual tree may pick up.  

Another popular type of classifier is k-nearest neighbors, also known as k-NN.  In order to picture how this one works it is helpful to think of a simple graph.  The two axes of the graph represent features (or inputs) that the algorithm is using to predict the output.  Now all of the training data is plotted on this graph based on the values of these two features, with one outcome in blue and the other in red.  When the test data is introduced, it can be thought of as each point being graphed in the same way.  The prediction is then made based on whichever outcome is closest to the test data on the graph, hence the name nearest neighbor.  The "k" part comes into effect here too, as this parameter determines how many neighbors must be counted.  For example, let's say that the k value is five.  The test data point is plotted and the nearest neighbor is red.  Tally one point for red.  The next closest point is blue, so one point is tallied for blue.  This process repeats itself until one outcome reaches a "score" of five, and the test data point is then predicted to be that color.  This entire overview is a bit oversimplified however as there can be more than two features, but the general idea remains the same.  

After running through each of the different classifiers and tests, I was ready to make my conclusions.  My less important conclusion was that in my tests, the decision tree was the most accurate, followed by the k-NN classifier and finally the random forest.  The results would have been very promising if my only goal was to make accurate classifications, as the test accuracies on fairly simple models were 98.3%, 96.6%, and 95%.  The problem, and what leads my to my more important conclusion, is that there was no real challenge in this particular data set.   Very little tuning could be done in a useful manner, and the accuracy was nearly 100% regardless with no glaringly obvious difference in performance between classifiers.  There were differences certainly, but they were very minor.  The conclusion that this has lead me to is that my next project should be focused on a more complex data set, and additional classifiers will need to be tested.  

Project link: https://github.com/dvb2017/iris-classification


Comments

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection

Movie (Data) Magic: Reviewing the Critics