Posts

Showing posts from January, 2021

Iris Classification

Image
  In 1936, Ronald Fisher published his paper  The use of multiple measurements in taxonomic problems , the basis of which was a small data set containing some attributes of iris flowers.  The data set is very straightforward, and describes 150 flowers with four measurements each:  sepal length, sepal width, petal length, and petal width.  Additionally, it lists the species of each iris, of which there were three different types equally represented in the data.  These species are Iris setosa, Iris virginica, and Iris versicolor.  Despite the simple nature of this data set, it would go on to become one of the most commonly used within machine learning, and it remains a standard test case for classification techniques to this day.  It is also the data set that I have decided to examine for my most recent project.  Now to explain a bit about this project.  The primary impetus for it was very basic:  I have worked almost exclusively on regression problems recently, so I felt that it would b

A/B Testing and Statistical Significance

Image
As a future data scientist currently looking for opportunities on places like LinkedIn, I read a large number job postings related to the field.  Occasionally something I see in these job posts will be a surprise to me, and this is one of those cases.  The surprise in question here was the massive amount of job postings that specifically require or recommend knowledge of A/B testing.  During my highly anecdotal and biased research into the subject while looking for job listings, this topic seemed to pop up more frequently than any other technical skill.  While I was initially surprised to see it, after giving it some thought I came to see that this topic is highly relevant and perhaps even under-appreciated by the average data scientist hopeful.  I would like to use this space to give an overview on A/B testing, clarify where it came from and why it continues to be relevant, and discuss the more technical aspects of it (especially with regard to statistical significance).   A/B testing

Revisiting Regression

 After discussing classification algorithms in last week's blog post, I felt that it would only be fair to also dedicate a post to explaining regression in greater detail.  Although my last two projects dealt with this type of problem, I never got around to discussing the particulars of regression.  So in this post, I will explain the primary uses for regression as well as describe a few specific methods.   The primary difference between classification and regression problems is the output they seek to calculate.  As discussed in my previous blog post, classification problems generally seek to reach a discrete output, or in other words one of a few distinct values.  The example used last time dealt with the iris dataset that I'm currently working on.  Here, several plant measurements are taken as inputs which are then used to determine which of the three species is most likely to be the output.  This is known as classification because the predetermined output values can be seen

Clarifying Classification

This week I have decided to take an in depth look at classification problems in data science.  Since most of my recent projects have dealt with regression, it seems prudent to thoroughly examine this other equally important aspect of data science.  I will give a brief overview of classification, explain a few important algorithms, and walk through my decision-making process when choosing a topic for my next project.  Finally, I will give a brief outline of my plan for this new undertaking.   Between classification and regression, the former strikes me as a much simpler concept to wrap my head around.  An example of regression would be using historic data to predict future prices of something like a house.  Various inputs are taken into account and weighed based on the strength of their relationship to the output, which can then be used to estimate future data points.  Classification is very similar in that it also weighs inputs in this manner, but rather than giving an output along a c