My Data Science Cookbook

Last week I wrote about how to transfer my knowledge into a digital format (see My Digital Brain). This week, I will show you exactly how that brain works for me during a data science project. Before jumping into the project, however, here is how the data science portion of my brain is structured in Evernote (EN):

  • Data Science Cookbook Stack – Main working area; Contains notes and code snippets to be used in each step of the Data Science Process below.
  • Data Science Process Stack – High-level notes on different types of Machine Learning models.
  • Data Science Projects Stack – Projects that I have worked on.
  • Data Science Studies Stack – Research papers, courses, and code-camps I have used.
My Evernote Data Science Stacks

This is a growing process, so I don’t expect each of these to stay this way for too long (I can already envision the DS Process Stack splitting out into Algorithms, Code Deployment, Models, etc.)

Data Science Process

As a recap, here are the steps of a typical Data Science project:

  • Business Understanding
  • Data Access
  • Data Cleaning
  • Exploratory Data Analysis (EDA)
  • Data Modeling
  • Data Deployment
  • (Any iteration of the above)

You can also read more about the Data Science Process in my blog post here.

Quick Note: Structuring the Process as a Task List and Keeping Myself Accountable

(Before I start, I can take each of these steps and create a new Habitica To-Do with sub-tasks for each step. I can then set a due date to keep myself accountable, and also set a difficulty for my reward. Since this is for fun and nothing too serious, I will keep it light and make each step a medium task to complete. Okay, onto the actual project…)

My Habitica Project To-Do List
My Habitica Project To-Do List

Business Problem and Understanding

First and foremost, I will have to pick a problem to solve. A good starter problem in the data science world is the classic Titanic competition on Kaggle. The first two steps of our project are completed for us because we have a clear objective and the data has been provided. For those unfamiliar, here is the summary of the Titanic challenge: Given a labeled dataset on the passengers of the Titanic, can we predict who survives? We are given a sample dataset with the column Survived, and based on that data, we need to extrapolate this to new, unlabeled records. (We do this by predicting a 1 or 0 for each new record – with 1 meaning survived). Then, submit that to Kaggle’s competition and Kaggle will inform us how accurate we are with our predictions. In this instance, Kaggle is giving us the objective, the “business” context, data dictionary, and the raw data, so we have everything we need to get started! (We can check this off our Habitica list!)

My Brain Researching Models

One commonly glossed over part of the Business Understanding is research. Yes, Kaggle has outlined everything for us, but there is much more we need to do before starting. For one, we need to decide on our models. In this case, we are predicting a binary 1/0, so we can use Logistic Regression, Neural Networks, SVM, Random Forest, or Decision Trees to classify new records (Please note: this is not an exhaustive list, but just some of the more common algorithms used). If I didn’t know that already, I would need to do research on exactly what methods could be used using something Kaggle, Google, or even StackOverflow. Since I do have a general idea of what I want to do, I can look into articles I may have already read about these methods and how they have been applied. In this case, I can use a search within Evernote to help me out:

Doing a search for Logistic Regression within my instance of Evernote

Seeing from my search above, I have 74 notes that talk about just Logistic Regression (shocking to me as well!). Evernote also has a sort by relevance within the search, which allows me to quickly skip notes that might have mentioned Logistic Regression only in passing. Go figure, the top result is an infographic all about Logistic Regression!

My Brain Researching Code

Now for this project, I will be using Python and you can see my notebook here. In the Python world, one of the most famous packages for machine learning is Sci-Kit Learn (sklearn). Within sklearn, we can see all kinds of models to use, and we can spend days researching each one of them. Instead of strolling through endless pages upon pages of documentation, I can do myself a favor. I can use my digital brain to help me out. My common process looks like this:

  • Find interesting model, use Evernote WebClipper to capture either the whole page or part of the page (this sends it to my EN as a new note).
  • On the new note, I make my own remarks at the top like “read more about this”, “how does data need to be prepped for this model?”, or simply “implement this”.
  • I can also use EN’s tag system to classify this new page as “-To Read” or “-To Try”. I also create a tag for the Kaggle Project, and when I expect to complete the project by (2021Q1).
  • Lastly, I can create a new notebook specifically for this project (in this case, I am going to put it in my Logistic Regression notebook since this is pretty general information). A mixture of notebooks/tagging will be helpful for “future me” when I need to refer to how I did something or what my thoughts were while tackling the project.
Using WebClipper to make notes on SkLearn’s Logistic Regression

I then repeat this for any models that I am interested in. Sometimes (okay, most of the time), the documentation can be too academic for me. In those cases, I can also go to Medium.com to find articles on people using or talking about the model. Again, I can clip and send those to EN with a tag/notebook. I very often take my 5-10 minute coffee breaks to just search for interesting articles for projects and then send them to EN to read later. It’s a surprisingly efficient process.

My Brain Researching the Problem

I also need to research the problem. (NOTE: I am not trying to cheat here, so I have to be careful what I search for. There are many people who have received 100% accuracy on the Titanic dataset. While I could just implement what they did by copy/pasting, I wouldn’t learn anything from it. Plus, I love challenges. I do recommend that you review others’ work, but only after you have tried to solve the problem yourself. This is akin to trying to solve a math problem yourself and then seeing the answer in the back of the book instead of the other way around!). Searching for similar problems (e.g. almost anything with logistic regression) could help me with this specific problem. Perhaps a similar solution used a lot of data transformations and one-hot-encoding which I could apply to this problem…

EDA and Note-Taking using Evernote

Once I feel like I have done an adequate amount of research, I can then start digging into the data. Kaggle offers some great high-level statistics on their datasets, but sometimes it is a good idea to test your chops on the data yourself. We can get counts of NULLs, see the IQR’s of each of the numeric variables, or get distributions of Age for each PClass and their Suvived status. This is a constant loop of Exploratory Data Analysis and Data Cleaning.

A little of my EDA with the Titanic dataset
A little of my EDA with the Titanic dataset

While I go through this cycle, I can use EN to take notes on each of my variables. Admittedly, most of my notes will live in the Python notebook itself, but some high-level notes will be great to have in EN for other studies. For instance, if I saw that Age is mostly NULL, how could I figure out a way to group different averages of ages to impute? (e.g. If the average age in the known records is 29, what are we assuming about those that are NULL? A NULL age could be due to passengers under a certain age (infants/children) not needing to specify an age – in this case, 29 would be way off the mark). This is a learning process for me, and leaving myself breadcrumbs on my thoughts and methods will pay dividends in the future.

With each of the step of the process – but more so with data cleaning and EDA – I keep a cookbook of code. Personally, I have had to learn R, SQL, Alteryx, Spark, and Python, and it all gets a little blurry. I still sometimes get stuck on “is it plt.pairplot or sns.pairplot or is it just pair_plot?”. Since my memory isn’t the greatest, I can use EN’s tagging system to find what I want by going to “data visualization” tag and then searching for pairplot to get what I want. I may have also created a really handy function/macro for a previous project that I could re-use for this project. Again, EN makes me work smarter not harder. I have done this all at some point, so there’s no sense in reliving a Google or revisiting a code course I took 2 years ago.

As I work through my EDA and Data Cleaning, I can also time-box myself using Forest. Setting my time up in 30 minute increments allows me to work through the problem but not spin in circles for too long. The extra bit of pressure to complete this step of the process in a given time also gets my brain working in high-gear.

Setting a Forest Timer for Kaggle

Feature Engineering, Modeling, Scoring

Once I believe I’ve got all features cleaned and ready, I can now start my modeling. Again, I can pull code from my other projects to set up train/test splits and even re-use some model codes. All of this is pulled from my EN cookbook. I iterate through this process a couple of times – adjusting hyperparameters, cross-validating, and so on – until I can then pick the best model. Taking that model, I then apply it to a new, unlabeled dataset. Finally, I output the results to a .csv on Kaggle, and then submit those results to the competition. As of this writing, the notebook mentioned above has an accuracy of 77%. Back to the drawing board!

Wrapping Up the Project

After each step, I can check off my tasks in Habitica and see how much time I have spent with Forest! I can capture all of this information into a new note on the project within EN and put my recap in there. Maybe next time I will fill in NULLs differently, or I will square a numeric feature to see if that gives me any improvement. Maybe I won’t spend too much time on EDA, but I will focus more on optimizing models using something like grid search. Fortunately (and unfortunately), there is a lot you can do to explore and manipulate the data, and being organized can help you cover a lot of ground very quickly!

I hope you found this post helpful! Please let me know your thoughts in the comments. Also, if you liked this post and want to see more like it, sign up on my email list to be alerted when new posts are available!

Processing…
Success! You're on the list.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close