The Data Science Method

Whether you are transitioning into a career in data analytics or have already been in the industry for a while, it is crucial to have a process to which any and all analyses can be built from. Now, this does not mean that everything will perfectly follow this process – in fact, you may need to do several iterations or skip certain parts entirely depending on your (and your data’s) analytical maturity level. (Wait, what?! How dare you! I am mature!)

The Process

Collection
Clean-up/Mutate
Exploration
Model
Report/Productionize
The Data Science Process

Now, admittedly, this is not a revolutionary diagram. There are plenty you can find a lot of different images and explanations on the subject just with a simple search of “data science process model” on your favorite search engine. However, I put this together as a quick, condensed view. So let’s break down each part of the process.

Collection

Collection and Clean-up will, unfortunately, take the majority of your time when doing any data analysis. With collection, it’s always important to ask some critical questions:

  • What question am I trying to answer? (This should be a concrete question before beginning! If you don’t have a clear question, you are going to burn a lot of time during the other steps)
  • What types of data will I need? (Eg. Customer, Sales, Dates, Locations, Systems).
  • What levels of data will I need? Do I need to go all the way down to a transactional level or can I stay high-level? This also dove-tails into…
  • Who is my audience for the results? More on this in the Reporting section.
  • What do I need to know about the data before beginning? Do I need a Subject Matter Expert (SME) to help me?
  • What are some of the next questions that could be asked? ( If I am reporting on a complex sales problem, will we want to dig into customer data after we get the first question’s results?)

Once you can answer these, you have a fantastic foundation to build upon and will have saved yourself hours (possibly days) worth of work. Very often, we get asked a question or are explained a problem, and we want to dig in right away. Using these questions and more as a guide, you’ll quickly become a go-to professional for analytics within your organization. On the opposite end – even if it’s not your fault – by not asking these questions, you appear to be incompetent or burn unnecessary hours on a time-senstive project (and nobody wants any of that!).

Clean-up & Mutate Data

Let’s talk about a concept called tidy data real quick. According to R Project’s website, tidy data is “is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Let me breakdown those three rules real quick. 1) Your variables ({Day, Month, Year}, {Store, Address, Manager}, {Sales, Tonnage, Price, Cost}) should all be unique across the columns of your data. 2) each observation, or row, of data should be unique going down your data. 3) Both of these combined can form their own table. For instance, if I had a table of sales divisions and their division managers, Division Managers and Division would be a column, and each observation would the unique values under each of those. (Note: If John Smith has two divisions, the observations would show {John Smith, 1} and {John Smith, 2}. While his name is repeated, the whole observation or row is not). This table can then feed into a larger table of more variables and that feeds another table and so on. Tidying data is truly cleaning up your data set to prepare it for visualizations and modeling.

Tied into tidying your data is mutation or calculation. I like the term mutating better (taken from John Hopkins’ Data Science course on Coursera) because if you imagine your dataset as something that can be changed (nay, living!) instead of just a spreadsheet or table, it changes how you treat your data. Either way, mutating your dataset means creating custom calculations off of your data. A simple example would be Sales – Cost =Profit; More complex examples can be If…Then… and weighting certain observations based on that If-Then criteria. This mutation is not explicitly from your data sources, but it is derived by the variables and observations you have gathered.

With our data in a tidy format and having all the variables we need, we can now move to the next part: Exploration.

Data Exploration

Personally, this is my favorite part. This can also be a time-consuming part of the process, but it is all about trying to find insights and possibly poking holes in your data. You can histogram, bar chart, line graph, and visualize your data in 1,000 ways. You can also summarize up and up or transform your data until you reach your Aha! moment. It is important to give yourself ample time in this part of the process to understand what you are looking at and be sure to ask lots of questions like is this what I expected? or what is the root cause for what I am seeing? Likewise, you may also discover that you are missing variables or needed observations during this part, and that means we need go back to the beginning of the entire Data Science Process to refine our work! (To be fair, you may not need to re-do everything, but just walk quickly or mentally through the steps to make sure you’ve got everything covered). Tools like Tableau, Qlik, and DOMO are really helpful in this stage of the game.

Also, during this stage, you may want to create a model tournament where you compare different tools and their algorithms against each other. Especially if we are trying to build a model where we want to predict future values, we will want to try regressions, random forests, or even neural nets against each other to see which method explains our data the best. Alteryx makes this really easy with some of their pre-defined tools. R and Python also have some of the most robust tools to customize your model, but building and comparing results takes quite a bit of understanding about the functions that drive the models.

Data Modeling

Once you have explored and found some golden analytical nuggets, now we refine our model and prepare ourselves for the last stages. Here, we decide on the main visualtions and models that we will need for our report. This is also a great opportunity if you haven’t done it along the way to comment out your code. Nothing is as frustrating as creating a workflow or script and then having to return to it months later when something breaks only to find there’s no comments! This’ll make you want to blow it all up and start from scratch! If you didn’t do it during your data exploration phase, this is where you would create your model tournament and pick the best model(s) to explain your data.

Reporting & Productionizing

If you didn’t decide this up-front, here is where you need to decide on how this report or model will be consumed by end users. Is this a one-off? Is this going to be something to drive departments within your business? Does this need to be a formal report or is it a foot-note in a larger report?

This all goes into understanding your audience. If you are reporting to executives, perhaps they just want to know the accuracy and some highlights of how your model works. If it is a fellow analyst or your boss, maybe they want to know more of the technical details of what you did. This also goes into data storytelling and working on your ever-growing presentation skills.

From here, do you need to have a formal report or are you deploying your model to be used by others or ran every so often? (Pro tip: if you are just deploying, always, always, always create a report to explain the model and how it works – even if nobody asks for it!). If it’s a formal report, make sure you have someone review it before presenting it so as to cover any gaps in logic. If its productionized deployment, make sure you have clear instructions on how to use the model and how it is supposed to work.

Last, but not least, you want everything you did to be reproducible. If you cannot reproduce your results, then did they really happen in the first place? Maybe this is a harsh criticism but it is important to think about, especially as you are trying to build credibility. You want another analyst to run your same report and get the same results. If you are productionizing a model, you want to test its results every so often to see if it needs to be tweaked. (If it does, guess what? You’re back to the beginning of your Data Science Process! Yay!).

Congratulations!

If you have just used this method for a report or productionized model, you have completed your first Data Science Process! If not, try and follow this for your next big project. You will find it takes some extra steps than what you might be used to or have done in the past, but the dividends from this practice makes you so much better at your job. From here, I wish you nothing but happy data moments and career success by using this methodology!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close