Supervised vs. Unsupervised Learning

 

This week I have decided to take an in depth look at the two main types of machine learning, supervised and unsupervised.  While this is a topic that was covered in my recent data science course, I found myself particularly curious about it and wanted to dig in a bit deeper.  There are a few particular angles from which I would like to approach this subject, so I will lay out my plans here.  First and most importantly, I will provide a satisfactory definition for each of these two terms.  I will then examine the key differences between the two types of machine learning.   Finally, I will examine several different examples and use cases for both of these.  To provide a bit of context for the conversation, machine learning, both supervised and unsupervised, is essentially the use of an algorithm to detect certain patterns within a data set.  While it might sound like there isn't much leeway within that category for significant differences between the two types of machine learning, the distinction is actually fairly important and affects how these algorithms are implemented.  

I will start off by looking at supervised learning.  To put it simply, this is a process which maps input values to output values based on examples consisting of other input-output pairs.  This generally follows a fairly simple pathway.  First, the algorithm is fed what is known as training data.  These are randomly selected input-output pairs that represent a selected percentage of the data set.  After being trained on these values, the algorithm is able to produce something known as an inferred function.  This function has essentially "learned" the pattern or relationship between the input and output values, and is able to use what it now knows in order to operate on the remaining data.  These input output pairs that were not included in the training data are known as the test data, and as you may have guessed, they are used to test the accuracy of the model that was trained.  As a side note, the very intuitive name for the technique of separating the data into these groups is a train-test split.  At this point, the test input data is fed to the inferred function which predicts the output values based on what it learned from the training data.  These values can then be compared to the actual output values from the test set to determine the accuracy of the model that was created.  

One important side-topic to touch on when discussing supervised machine learning is the bias-variance tradeoff.  First, I'll give quick definition of these two namesake terms.  Bias is simply an error caused by incorrect assumptions that the algorithm makes during the training process.  A high level of bias essentially means that the algorithm has missed some key relations between data and leads to underfitting of the test data.  Variance error on the other hand is an error caused by the algorithm picking up very small movements in the data, and a high level of variance will cause it to model random noise from the training data set.  This can, in turn, lead to overfitting, where the training set is modeled with too much overly specific accuracy.  Since the model is so specific to the training data, it does a poor job fitting the test data.  As you may have already observed, bias and variance are essentially opposites, with high bias models failing to even pick up on large trends and high variance models latching on to tiny unimportant movements.  Bias-variance tradeoff then, is the property of a supervised learning model wherein increasing one of these errors decreases the other and vice versa.  This presents an important challenge to data scientists who aim to minimize both of these error terms.   

Next up is unsupervised machine learning.  This is the less intuitive type of machine learning in my opinion.  With these algorithms, patterns are learned from untagged data.  Since testing using train-test splits or similar methods are not an option here, the machine is instead tasked with creating an internal representation of the world.  Note that I'm intentionally being a bit vague here and much more information will be provided when I discuss the key differences between the two type of learning.

Speaking of those differences, now seems like a perfect opportunity to segue to that.  Perhaps the biggest difference between supervised and unsupervised learning has already been mentioned, although it was done so using a term that might be unfamiliar.  Untagged data is another way of saying that the data fed into these algorithms is not labelled.  While supervised learning is used to classify human-labelled data, unsupervised learning seeks to determine the inherent structure of data without relying on labels.  Because no labels are used, another difference between the two types of learning arises.  It is quite easy to compare the performance between supervised learning models by examining their accuracy in most cases.  With unsupervised learning however the lack of labels prevents us from effectively making these comparisons.  The final difference will be explored in the following section where I examine a few use cases for each type of learning.  As you will see, the types of tasks that each type is best suited for vary wildly. 

The use cases for supervised learning are much easier to understand in my opinion.  I've actually discussed the two most common categories of cases on my blog before.  The first is classification.  Here, the model is trained to predict discrete output values.  An easy example of this would be the iris data set that uses various measurements to classify the flowers into one of three species.  The other common use case is regression.  This operates in a similar way to classification, but instead returns a continuous output value.  For example, using various factors like lot size and number of rooms to predict the price of homes.   

For unsupervised learning the use cases are more complicated, so my descriptions will be a bit more abstract.  The first of the two most common uses is cluster analysis.  Here, data is fed into the algorithm and objects are grouped according to their similarity to certain groups over others.  This is a vital process in data mining, which is the search for patterns in massive, complicated data sets.  The second use case is dimensionality reduction.   This is, as the name suggests, the transformation of data from a high-dimensional space to a low-dimensional space, while still retaining important traits from the original data.  The easiest way to understand this is to think about it as a method of simplifying a very complex data set.

So as you have seen, even though supervised and unsupervised machine learning are both methods of identifying patterns in data, they have some very significant differences in both functionality and the purposes for which they are used.  I hope this blog has given you a better understanding of what each of these is exactly and how they function.  As a result of my reading on the subject, I have actually become quite interested in working on a project using unsupervised learning, so be on the look out for more info on that in an upcoming blog post.   



Comments

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection

Movie (Data) Magic: Reviewing the Critics