Getting Started on Your Data Science Project

Today I’m going to be writing a quick guide on how to get started on your first data science project.  I will be assuming the reader has very little experience in data science and giving an overview of the entire project process from start to finish, with particular attention paid to the early stages.  


The first and second steps to follow, choosing a topic and dataset, can be done in either order depending on your preference.  In some cases one might be better suited than the other.  For example, choosing a topic first doesn’t do much good if there is no data available or it can’t be easily gathered.  Likewise, a dataset by itself isn’t very helpful if you aren’t able to think of any interesting topics to explore with it.  


 Choosing a Topic


My personal preference when getting started on a project is to choose a topic first and then look for a dataset that is applicable.  For example, on my most recent project I decided on a topic related to the stock market.  With stock data so easily available online, finding this after I had chosen my topic was no problem.  If I had chosen a more niche topic I likely would have had much more difficulty.  If you would like to choose your topic first I would suggest picking something that will be simple to find data for online.  


When thinking of a topic, it can be helpful to choose one that interests you first and foremost.  This will make the entire process easier since you will be more invested in the results and it will be easier to think of relevant questions.  If the topic you chose is somewhat broad, be sure to narrow your focus by aiming to answer a specific question.  This question should be solvable given the data, and it is also helpful to think of a business case for answering it.  There is a wide range of potential questions that could work for any given dataset, I’ll list some below that could apply to the stock market example mentioned earlier, ranging from simple to more complex.

  • Find the stocks with the largest gain over the past month.

  • What is the sector that has grown the least in the past year?

  • Which features have the highest correlation with stock prices?

  • Using machine learning, predict the value of a stock in one week.

  • Which machine learning algorithms offer the best predictive value for stock prices?

While all of these questions would work with the dataset that I used, they vary greatly in difficulty.  Now would also be a good time to mention that when you are choosing a topic and deciding on a question to answer, it is also important to choose a question appropriate to your skill level.  Some of these prompts require little to no deep analysis, while some call for multiple machine learning models.  None of these ideas are invalid, and you should choose a question that you feel comfortable solving.  Picking something far beyond your capabilities will make the process much longer and more confusing, while a question that doesn’t challenge you enough won’t teach you very much.  I often find that choosing a topic slightly above my comfort zone leads to the best results, where I still need to learn and push myself, but the goal is still achievable.  


Finding a Dataset


There are two main ways to obtain data once you have chosen a topic.  You can either find a dataset online, or create your own.  When using a premade dataset, there are a few main parameters that are worth giving some attention.  The first would be size.  My recommendation here would generally be the bigger the better.  The upper limit will be decided by either storage or computation time.  If neither of these presents an issue, there is no need to worry.  If either value becomes too large, you can always take a random sampling of the data.  There are much more prevalent issues when a dataset is too small.  It is often the case that results will be insignificant and machine learning models may not have enough data to be trained, leading to inaccurate results.  Small datasets should definitely be avoided wherever possible.  Another aspect to pay attention to is the quality of the dataset.  In some cases this might be difficult to discern, although certain sources will have user ratings or a “usability” metric that serve as a proxy for data quality.  There are many ways to check for quality, such as importing the data as a dataframe and viewing the summary.  This will reveal whether data is in the correct format, and can also reveal any abnormalities in the summary statistics such as the mean value.  As for where to find this type of data, there are a number of helpful sources.  I’ll list a few below.

These sources all contain a wide variety of datasets, and there is an even wider variety of analysis that can be done using them, from the very simple to complex.  If there is no dataset available, a possible alternative would be making your own.  The most common way of doing this would be scraping the web for data, but this topic is complex enough to make up its own blog post so I won't go into detail here.   


The Project


Once you have decided on both a topic and dataset, it’s time to work on the project itself.  My overview here will be a bit more brief than the earlier stages since it is more dependent on what you actually choose to examine.  At this point you should have formed a clear question that you would like to answer.  You will now begin analyzing the data to solve this question.  There are several important steps that should always be taken.


First is data cleaning and feature engineering.  These can be handled separately but I prefer to work on them at the same time.  Cleaning the data means putting it into a format that can be worked with easily, and accounting for any missing data.  For example, this is where you should put dates into datetime format, rename any columns, and set the appropriate index.  This is also a good time to perform train test splits and scale the data.  For feature engineering you should add any columns that might be helpful in answering your question.  Examples here would be moving averages, adding dummy variables, creating a log transform column, etc.  


Another important aspect of the project will be data visualization.  In my experience, this actually happens during at least three different stages in the project, with three different goals.  The first is during the initial data cleaning phase, when visualizations can be useful for checking on relationships within the data, outliers, correlation between features, etc.  Next is during the modeling phase, when visualizations are useful for evaluating model performance.  Sometimes they can even be used for selecting model parameters.  An example of this would be using the Box-Jenkins method to estimate ARIMA parameters, such as in the graphs below.  

Finally, once modeling is complete visualizations can be used to illustrate the actual answer to your question.  Below is a graph from a previous project where one of the goals was to illustrate how well various stock sectors had performed over the past year.  These tend to be the most important visualizations and there are a few guidelines that should be followed.  Since these graphs will normally be the centerpiece of any presentation on the project, they should be labelled on both axes, include a title, and have a legend if necessary.  The graphs should also very clearly communicate the answer to your question.  In the graph below for example, I’ve started each sector index at the same point so that any differences in performance will be evident by the end of each line.  


Perhaps the most important stage is the modeling itself, since this determines your answer to the question.  I won’t be going into very much detail on this stage however, since it will vary depending on what you are trying to answer and which modeling technique you are using.  As a vague overview though, this is where you will do the “problem solving” aspect of the project, including any mathematical processes or machine learning models.  It is also best practice to test the validity of your model if possible.  This is especially important with machine learning.  


At this stage, you should have all of the necessary information to answer the original question.  I usually include a “results” section at the end of my Jupyter Notebook where I restate the business case.  I will then provide the answer I arrived at through my modeling and make some recommendations based on that answer.  This is also where I explain any potential issues in the analysis, such as whether my model had a high error value, or if results were not significant.  As an aside, it is important to remember that insignificant results can be just as valid as significant ones, so don’t worry if this is the case.  One last detail that can be included here is any ideas you might have for future work on the topic, for example improvements that can be made to your model or even other questions you would like to answer.  And with that, your project is complete!

Comments

Popular posts from this blog

Intro: Exploring Project Euler (#25)

Credit Card Fraud Detection

Movie (Data) Magic: Reviewing the Critics