A/B Testing and Statistical Significance

As a future data scientist currently looking for opportunities on places like LinkedIn, I read a large number job postings related to the field.  Occasionally something I see in these job posts will be a surprise to me, and this is one of those cases.  The surprise in question here was the massive amount of job postings that specifically require or recommend knowledge of A/B testing.  During my highly anecdotal and biased research into the subject while looking for job listings, this topic seemed to pop up more frequently than any other technical skill.  While I was initially surprised to see it, after giving it some thought I came to see that this topic is highly relevant and perhaps even under-appreciated by the average data scientist hopeful.  I would like to use this space to give an overview on A/B testing, clarify where it came from and why it continues to be relevant, and discuss the more technical aspects of it (especially with regard to statistical significance). 

 A/B testing is a way to compare two versions of something to figure out which performs better. While it’s most often associated with websites and apps, the method is almost 100 years old and it’s one of the simplest forms of a randomized controlled experiment.  Below is a simple example of A/B testing.

A/B tests are useful for understanding user engagement and satisfaction of online features, such as a new feature or product.  Every large social media sites use A/B testing to make user experiences more successful and as a way to streamline their services. These tests are being used to run increasingly complex experiments, such as , how online services affect user actions and network effects when users are offline.  Many jobs use the data from A/B tests, including data scientists, marketers, designers, and software engineers.  Many positions rely on the data from A/B tests, as they allow companies to understand growth, increase revenue, and optimize customer satisfaction. Companies use A/B testing to understand how strongly the results of an experiment, survey, or poll they’ve conducted should influence the decisions they make. The key feature of these problems is that they have a set number of possible outcomes.  That previous example would be a binary classification problem because there are two possible outputs.  

This leads to many instances where a data scientist might choose A/B testing.  Most importantly, it should be utilized if the output they seek to predict is a discrete value.  Another situation where classification can be useful is when time does not affect the outcome.  Take the housing price example for instance.  Now imagine that I only had access to 2021 data for training.  The idea of using classification in the manner I described works fine if the only data that will be examined is all from the same year, but what if I tested the classifier on housing data from the 1980 when prices were generally much lower?  The result would be very inaccurate classification, since many homes worth over five hundred thousand dollars today likely would have been worth much less in 1980.  Classification is better suited towards a goal like predicting which incoming emails are spam.  Here there is a binary outcome and it remains largely unaffected by the time of the email.  

The reason I prefer the small dataset in this case is because my plan for this project is to iterate through many different classifiers in order to test their performance.  Using a massive amount of data would make this far more time consuming.  There are two goals that I will set out to achieve with the project.  First, what is the most accurate classifier that I can produce?  Second, how do the various classifiers stack up against each other?  While the first question is generally more practical, I'm much more interested in the second question, where I plan on doing a much deeper analysis.  As someone who works with data and machine learning, I feel that it will provide useful insight into the differences between these classifiers. 

There are many different types of classifiers with various strengths and weaknesses, so I would like to give a quick explanation of a few of the most popular examples as this will be a topic covered by my upcoming project.  

An expansion on the decision tree is another algorithm known as a random forest.  As the name suggests, this classifier takes advantage of many trees rather than just one.  This is done primarily to counteract the main weakness of the decision tree, which is that they have a tendency to overfit to their training sets.  Essentially, when a tree becomes sufficiently deep by testing a large number of attributes, it can tend to be overly specific to the training data.  This leads to a high level of variance once the test data is introduced.  The random forest attempts to solve this problem by averaging out the results of trees from different parts of the training data.  This helps removes overly specific patterns that an individual tree may pick up.  

Another popular type is k-nearest neighbors, also known as k-NN.  In order to picture how this one works it is helpful to think of a simple graph.  The two axes of the graph represent features (or inputs) that the algorithm is using to predict the output.  Now all of the training data is plotted on this graph based on the values of these two features, with one outcome in blue and the other in red.  When the test data is introduced, it can be thought of as each point being graphed in the same way.  The prediction is then made based on whichever outcome is closest to the test data on the graph, hence the name nearest neighbor.  The "k" part comes into effect here too, as this parameter determines how many neighbors must be counted.  For example, let's say that the k value is five.  The test data point is plotted and the nearest neighbor is red.  Tally one point for red.  The next closest point is blue, so one point is tallied for blue.  This process repeats itself until one outcome reaches a "score" of five, and the test data point is then predicted to be that color.  This entire overview is a bit oversimplified however as there can be more than two features, but the general idea remains the same.  

A/B testing, at its most basic, is a way to compare two versions of something to figure out which performs better.   Rather than being just one algorithm, this term generally refers to a family of similar classification models all operating under the same assumptions.  Namely, that the value of a particular feature is independent of the value of any other feature.  So in the simplest usage of Naïve Bayes, a probability table is created for each feature.  Then when test data is entered the probability is calculated for which output is the most likely based on its features.  In theory this classifier seems oversimplified since the assumption that all features are independent is often incorrect, yet it has still proven to be effective in many real world examples.  

Statistical significance is a concept that is highly important throughout all of statistics and data science, and it also comes into play here.  To give a simplified overview, statistical significance is a measure of whether you can be confident that a sample is representative of the larger population, or whether the sample may have been a "lucky" or "unlucky" draw.   The main factors driving this figure are sample size and the difference between the sample and population.   It is important to be sure that a result is statistically significant before implementing policies and procedures in order to prevent an unlucky sample from potentially aiming those measures in the wrong direction.  


Comments

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection

Movie (Data) Magic: Reviewing the Critics