Movie (Data) Magic: Reviewing the Critics

    It certainly takes a specific set of skills to be a movie critic.  Between the knowledge of film history and theory, the ability to eloquently articulate their thoughts, and a keen eye for detail, there is a reason many people look to critics when deciding how to spend a Friday night at the movies.  Despite the occasional upset fan, there is clearly a demand for movie criticism, and the public has put their collective trust in this group to judge the quality of films.  What I would like to examine in this post, however, is not the critics' ability to grade a film on its merits.  Instead, I am interested in examining some data to see how well professional reviews can predict a movie's box office success.  In addition to that, I will also compare critic reviews to audience reviews to see if one group is a better indicator of a high grossing movie.  Since this post is a part of my first project in a data science course, I will also detail some of the steps involved in this type of analysis.  
    To begin, I will briefly outline the data I'll be using.  The two datasets pertaining to critics are both pulled from Rotten Tomatoes.  One contains a list of reviews and the movies that they were for, with multiple reviews for each movie.  Each review is marked as either "fresh" or "rotten", in keeping with the standard Rotten Tomatoes grading scale.  The other data set contains a list of movies with an 'id' corresponding to the previous list, as well as an MPAA rating, release date, and box office gross figure for each film.  These two datasets are all that I will be using for the critics.
    For the audience data, there are three datasets.  The first two are from IMDB.  One of these contains the movie 'id', title, and and release year.  The other contains the 'id' and user rating, but no title.  The final data set is from TheMovieDB.  It contains entries for movies containing their title, production budget, and worldwide gross.  
    The first step towards analyzing these various datasets is to change them into a form that contains all of the necessary information, and also has that information in the correct format.  This is known in the data science world as "data cleaning", and is necessary for nearly every data set you will come across.  
    In order to clean the critic data, we actually need to squeeze a bit more information out of the first data set first.  Namely, I would like each unique movie to have a score assigned denoting what percentage of critics gave it a 'fresh' rating.  To accomplish this I create a new column containing a '1' for each fresh review, and a '0' for each rotten one.  Numbers are generally much easier to work with in this context.  Having done that, I can create a new data-frame containing one entry for each unique movie.  I then create a column listing how many reviews appear for each film, add up the number of fresh reviews, and divide by the total number of reviews.  This gives me a percentage score for each film, similar to what Rotten Tomatoes presents on their site.  Then I join this data set to the other one from Rotten Tomatoes using the 'id' column, dropping any movies that don't match between the lists (these would have incomplete information and not be useful in later calculations.  
    I am now able to plot the data as well as make calculations.  Namely, I can use the percentage score and the box office columns to find their correlation coefficient.  After some quick calculations, I find that the coefficient is ~0.06.  A quick refresher on correlation coefficients - the closer they are to 1 or -1, the stronger the relationship between the two sets of data.  Coefficients close to zero like this one indicates a weak or nonexistent relationship.  A quick significance test reveals that this correlation is not statistically significant at the 95% confidence level.  
    Going back to the audience datasets, we must now perform similar operations in order to analyze this data.  The IMDB sets are easy to join according to the unique movie ID they assign each entry.  Once we have the two IMDB data sets combined, we can join them to the final one from TheMovieDB according to the movie titles.  Since the two sites don't necessarily have the same formatting rules, I removed most symbols from the titles that might cause problems.  Once these sets were all combined I ran into an issue that I'm sure most data scientists are very familiar with.  All of the numbers representing box office results and production budgets were in string format.  Once these were reformatted into actual numbers, I was able to perform calculations on them, so I created a new column 'profit'.  I set this equal to the worldwide gross minus the production budget so that I could do some further analysis later on.  
    With that, the audience data was also ready to be analyzed.  I performed essentially the same steps with this data as I did with that of the critics, and my result was very surprising.  The correlation coefficient came back as ~0.224.  Not only is this higher than the critic score of ~0.06, the audience data set was also quite a bit larger than the one for critics, making a number this far from zero statistically significant at the 95% confidence interval with ease.  Since the audience data contained profit as well, I analyzed that in the same way.  The coefficient here was essentially the same, at ~0.222, and this result was also significant at the 95% confidence interval.  
    So without all of the technical talk, what does this mean?  Basically, the results from analyzing these datasets showed that using critic scores to predict box office success would not be very effective.  Audiences reviews, on the other hand, seem to be a solid indicator of a movie's gross and profit.  Specifically, the higher the audience reviews, the more money the movie tends to make.
    If you stop to think about the implications of these findings, it might not be such a surprise.  Critics, after all, aren't trying to predict a movie's success, they are only interested in judging its quality.  I would venture that the average critic has slightly different taste in film than the average non-critic due to their expanded knowledge of the medium.  Besides that, social media is empowering word-of-mouth recommendations like never before, increasing the reach of the average moviegoer and their review.  Whatever the reason, it would appear that if you want to predict the next smash-hit at the box office, you might be better off asking your neighbor than reading a critic's review.  

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection