Much like all of data science, there is no “one size fits all” in how you do the work. However, my experience has taught me a few basics that are either commonly overlooked or forgotten entirely when working on data projects.
I have inherited projects and even created projects that ended up being a complete train wreck in a folder. I have data, workflows, .pdfs, and previous versions of everything all thrown together. One thing I repeat to myself constantly is the concept of everything has a place. It’s important to keep your data, data processes, and reports separate from each other. It’s also really important to have a README file in your main project folder. This should include your project name, date of creation, explanation of the project, stakeholders, and who owns the project (hint: it’s you). Before anyone goes digging through each of these folders, they should go to the README file and get a general sense of what they are messing with.
The first thing to do on any data project is document, document, document. Create diagrams on how you are pulling data, from where, how it will be treated, and any process steps in between. Add comments as you go along in your code and leave yourself TODO’s on anything needing following up on. This doesn’t have to be live-updated or 100% accurate when starting out, but it has to be robust enough to hand a project over to someone else without them guessing as to what the heck was going on.
The next thing to think about with any project is how to “undue” your work and reproduce what you have done. This is typically handled by archiving the base data, writing logs, and having good documentation on your work. It should not be verbose. At bare minimum, your logs and comments should explain what you’re doing or how it was done. Also, archiving the base data should be standard for data analysis and machine learning projects. If someone asks you how you arrived at a number, but your base data was manually scrubbed to a state that it is no longer recognizable, you’ll have a lot of explaining to do!
Another thing that you should seriously consider for most projects is GIT. I am not saying you have to be a software developer or be committing to GitHub every day, but learning a few things about GIT will take you a long way. You can also save your commits locally, in case you are worried about publishing things to the cloud. Version control software is your best friend when things hit the fan. If you changed a script that accidentally deletes out a file folder in your project, you can call back the folder from a previous .git version. You can also see how the project has changed over time at a high level, which is helpful when you have to explain why/how certain changes were put into place (or when you have to explain to your manager what all you’ve had to do to a project to satisfy your customers).
I am even guilty of not following this advice to the letter, but every time I do it, I thank myself. I cannot tell you how many times I have had to build something only to be questioned about it 6-9 months later. The wonderful thing is if someone has a question, you can refer them to the documentation without you having to explain what’s going on. So help yourself and others by doing things the right way once.
Know when to automate…
There are a couple of maxims in the data world. The first one is more of a joke: “I’ll only do this once…” (IWODTO). It refers to the manual data cleaning that happens for projects. Excel users are notorious for this. They open up a file, notice something is wrong, fix it or tweak a formula, and then save over their original file. Then next month, they get a new file and run their automation only to find it doesn’t work anymore! Going back to the previous section, document your steps. If you ultimately want to reproduce your work and automate it, avoid the IWODTO trap.
The second maxim is “If you do it 3 times, turn it into a function”. While we should hesitate to jump the gun on automations; If there are constant processes you are running, it serves you well to understand functional programming. For instance, if I have a process that parses the month name from a date and adds it to the front of the file name, then I should write that as a helper function and import it as needed for my projects. (For those of us familiar in Alteryx, this would be the same as importing your own custom formulas and macros).
The last thing I would look out for before automating is stability (or inversely, scope creep). Many times we start on a project only to find more data is required or the business needs something slightly different. A lot of this has to do with not having clear questions or expectations up front, but that is for a different post. But if you can recreate the end state that the business needs through your process without manual intervention: Congratulations, you have an automated product! Now make sure you document it, version control it, and be on the look out for scope creep!
Output Your Findings Every Time…
This one took me a while to learn, but it is extremely helpful in showing your findings and changes over time. If you have a function that outputs data quality charts, then you should write those to a .pdf and store them in a special folder with a date on the file name. That way, when you re-run your script or workflow, it outputs to the same folder with a new file name and you can compare how things have changed since the last run.
Actual Project Folder Structure
So far I’ve explained away the “what” and “why” certain aspects should be included, and now I will show “how” I structure my projects. For a typical project, I do something similar to the following:
|—>.git |—>0.Logfiles |—>1.RawData |—>2.ProcessedData |—>3.Notebooks (or Workflows) |—>4.Images And Diagrams |—>5.Models and Reports.gitignore (specifying files not to track, if needed)
README.txt License (if needed)
- Your .git folder handles version control.
- Your #0-2 folders handle your breadcrumbs.
- Folders #3 & #4 your code, documents, and development
- Finally, folder #5 handles your outputs.
Even if you aren’t working on a project that requires each of the folders, it’s still a good practice to have them in case you need them or if the project changes over time. And if this seems like a lot, I have a useful library for you python users: CookieCutterDataScience. This library allows you to quickly spin up these folders and more for any project! What’s also great about Cookie Cutter is that you can define your own templates and call those anytime locally.
So what are your thoughts? Is there anything in particular that you do for each data project you work on? Please leave a comment below.