A lot of analysts use R or Python to sift through massive amounts of data, which are uhhh-mazing tools with packages too many to count. However, sometimes it is difficult for those who do not have a programming background like myself to pull insights out if they do not know the right packages and commands. So today I thought I would show off some of the capabilities of Alteryx to quickly pull out some insights from a large data set using some beginner and intermediate techniques.
The data set I am using comes from Data.gov – a free website that contains tons of large data sets. Specifically, it is the Consumer Complaint Database (approx. 561MB or 1.1M records, found here; You can also view the Data Dictionary here). I went ahead and downloaded the data as .csv for today, but there are other tools to pull directly from the site to help us out.
So first, let’s bring our data into Alteryx…
Input Tool, Preview Pane, & Results Pane
Just by dragging and dropping a .csv in my workflow (or I can point to a file location), Alteryx will automatically recognize the .csv and even give me a preview of 100 records so I can see what I am dealing with. I also can run the workflow to see the Results pane which gives me a heads up on how many NULLs or missing values I have, among other things. It’s also important to note that by default, Alteryx will pull .csv’s as V_Strings with a 254 character value on the fields. This may not mean much, but for fields like Product where my longest length is 76 characters, I am giving up a lot of efficiency for nothing. And while 178 characters isn’t too huge, if it is like this for all 18 fields and then ran through several tools, it can take a toll. So, now I’ll introduce my next favorite tool, the Auto Field Tool!
Auto Field, Select, and Data Cleanse
As you can see in the images above, I was able to reduce a lot of those fields to different types as well as reduce their size from 254 down to 3! Again, this will make a huge difference as I am running more complex tools.
When I start to Browse some of this data, I notice that dates still aren’t being recognized and that I still have a lot of NULL. values. In this case, I can change the Date Received field to a date type by using the DateTime tool. For the NULL values, I will use the Data Cleansing Tool.
My Date Received is now fixed to be an actual date (important for TS analyses and date-based calculations), and I can see that my fields are mostly “green” with a couple of “gray” values (blank values). This was all ran in just under 37 seconds on my machine – pretty impressive for 1M records!
Exploring the Data
Okay, so what? I have cleansed my data, what now? Now comes the fun of exploration! As you can see on my workflows above, I put a Browse tool at the end of each workflow. This allows me to see the metadata as well as some quick visualizations around the data.
For instance, I wonder which product is mentioned the most in this data set? Well, the Browse tool answers that!
Not only can see that “Mortgage” is the most frequently complained about product, but I can also see the frequency of other products as well as the metadata on this entire field.
What if I wanted to know how far this data stretches? Again, the Browse tool can help us answer that!
Looking at the min and max, we can see this data stretches from December 2011 all the way to June 2018!
Pulling out Insights
Okay, so I’ve spent some time getting, manipulating, and exploring the data. It is now time to pull out some insights and report on it. Before we can do pull out insights, we have to have concrete questions we are trying to answer! While it is neat to use these tools, it’s important to note that you can slim up your tool selection, workflow, and even field list if you have something very specific you are trying to answer. For instance, if I wanted to know if companies were giving timely responses to mortgages, I could Filter out products that aren’t mortgages and also get rid of over half of the fields. On the other hand, if I was trying to predict that a certain state’s mortgages were not going to be answered in a timely manner for this upcoming July, I would need quite a few fields to answer that question. Again, the basis for any good analysis always starts with a good, concrete question!
To keep things simple, let’s answer one of the previous questions with a slight twist: How is my customer service with FHA mortgage customers for J.P. Morgan? Maybe I am part of J.P. Morgan’s customer service team and I want to make sure that my customer service isn’t lacking.
One level I will look at is the number of disputes by year. If I don’t have people disputing my response, I have most likely satisfied the issue they were having. So after Filtering my data to JPMORGAN > FHA> Disputed != (“N/A”, “No”), I can then summarize the number of Consumer Disputed = “Yes” by Year. After I get this number, I can use the Charting tool to create a nice little visual that I can the pass along to my customer service manager to show them how this indicator has been trending:
*Field conversion errors were due to the free-form text field for the consumer complaint. Alteryx had issues cutting this field off when the character length was greater than 254. This is avoidable, but it was unnecessary to change anything as it did not impact my analysis.
With my workflow, I created logical containers for each group of tools. The first container is my data cleanse to get the data ready. The second container is the filtering down to the level of detail that I am interested in. Here, I used multiple filters to show my process, although I could filter all four rules into one filter tool. Notice, however, that there is a “T” and an “F” – If I wanted to use the False end of my Filter tool for another analysis, I could easily do so! Lastly, I take that filtered data, summarize by year and the distinct record count of complaint ID’s, create a chart, and use the Browse tool to view my handiwork.
After all that, it looks like we had a steady rise of disputes until the tipping point in 2014 and it has declined since! Perhaps there was a big initiative with customer service when we noticed we were trending higher in disputes? Or perhaps the amount of FHA mortgages decreased after a given time, meaning there weren’t as big of a pool of FHA mortgages to complain about. This goes back to the other side of being an analyst – it is not enough to be able to get and munge data, you have to understand your data and your business to pull out truly great insights! So here is what happens when we find out how many disputed resolutions contributed to the total number of of complaints:
We can see that although the total number of disputes goes down, it stays pretty constant with the overall number of complaints. (Also, 2011 and 2017 do not have as many data points as the other years, causing their numbers to be misleading). Therefore, I have held the status quo throughout the years and my resolutions aren’t making my customers feel any better. Lastly, I can make the case that about 20% of all complaints going forward will be disputed! Time to work on my negotiation skills!
I hope everyone finds this helpful! If there is anything you feel I glazed over, please feel free to drop me a line!