For those of you that were able to join us at the Alteryx NA Central meeting this week (all 150+ of you! WOW!), you were able to see my presentation on a predictive workflow within Alteryx using Python, R, and the native Alteryx tools. Given the popularity of this talk, I thought I would give a quick write-up on the details of how this was built and what-all it is doing. But first, we need to set down some foundations.
To begin, we need to talk about supervised learning and classification. With regard to supervised learning, this is a machine learning problem in which we have labelled data, meaning we know what is true versus what is false based on a feature in our data (Buy vs Did Not Buy; Category A vs Category B vs Category C; Sales Price). Contrast this with unsupervised learning, where the data is not labelled, and we may not know the proper categories based on the data. Finally, we will be trying to predict a categorical value given some data, which means we will be putting new data into classes or classifying new data.
Understanding The Problem
In this workflow, we will break down how to predict different Iris flower types. This is a well-known classification problem: we have measurements on three different types of Iris’ and given new measurements, can you predict what type of Iris the new flower will be: Setosa, Versicolor, or Virginica. We will be using data to teach the algorithms what makes each group, and then compare how well each model performs. And again, all of this can be achieved within Alteryx with little-to-no code!
Bringing In the Data
To begin, we will bring down the data using the Python Tool in Alteryx:
Here is the Production code that you can use to copy and paste into your own Python Tool within Alteryx:
################################# # List all non-standard packages to be imported by your # script here (only missing packages will be installed) from ayx import Package #Package.installPackages(['pandas','numpy']) ################################# #Bring in required libraries (Alteryx API is done by default) from ayx import Alteryx import sklearn import pandas as pd import numpy as np from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn import tree ################################# # Bring in the Iris dataset from SciKitLearn iris = datasets.load_iris() ################################# # View the data iris ################################# # Turn the data into a Dataframe, which we need to write out of the # Python Tool. Also, we have to do some transformations to the data # as it comes in two arrays. df = pd.DataFrame(np.concatenate((iris.data, np.array([iris.target]).T), axis=1), columns=iris.feature_names + ['target']) ################################# # Write out of the Python Tool. Also, we have to use “reset_index” # in order for the data to write out properly. This also gives us # a RecordID we can use downstream, if needed. Alteryx.write(df.reset_index(),1)
Side note: I am choosing to use the Python Tool here as I believe it will help others expand their toolkit. For those who sweat at the thought of coding, you could very easily manually download the files and bring them into Alteryx. That’s the beautiful thing about all of this – there is more than one way to read/manipulate data, and they are all equally powerful!
Exploratory Data Analysis (EDA)
A quick note on EDA: This is probably the most fun, most time-consuming, and most frustrating part of the workflow, depending on your approach. But because these predictive tools are sensitive to the data you are feeding into it, you will need to make sure you are correcting and treating the data before the model tournament (e.g. correct/remove NULLs, correct data types, etc). As an example for this problem, you will want to make sure each of the features you are feeding in are continuous variables (floats or doubles; not strings!). Another hard lesson learned: your column names should not have any special characters and avoid spaces. (These tools are all based on R language, and special characters can cause nasty problems in the underlying code). We can easily clean up column names and data types using the Select Tool. For NULLs, we can use the Data Cleanse Tool. For missing values, you’ll have to impute (fill-in) with your choice of average, median, or a dummy value. You will find that EDA is a bit of an iterative process as you may have to take an initial look, do some data cleanup, and then explore some more.
Because of the size of this data (4 features + 1 target column; 150 rows before creating train/test sets), this is small enough to where we could feed it into the algorithms with low computational cost (AKA it’ll run fast). However, we should do our due diligence and at least check for some structure in our data.
I used a Fields Analysis Tool to check the “health” of my data. Most importantly, I wanted to make sure I didn’t have any NULLs that would cause problems downstream. This same check could also be done in a Browse Tool, but I felt it was important to keep this in my workflow for future use cases where there could be many more features and different data types. Browse is good for digging into your data; Fields Analysis is good for better for quick insights.
The Correlations Tool is a neat little tool in the predictive toolkit which allows us to check different correlations by looking at a heat map of correlations as well as its corresponding scatter plot to see the relationship. Clicking through each one, we can get an idea of how each of the variables interact. Interestingly enough, the fields “petal_length” and “petal_width” are very positively correlated.
So seeing a correlation is good, but how does this translate into groups? For instance, are my flowers mixed across the correlation, or are they pretty well clustered? Fortunately, we can use Interactive Charting Tool to layer on the target variable to find out! Looking at the picture below, the groups are pretty well-defined with only some light mixing of the latter two groups (this will come into play later). For now, we have the information we need to move forward
Creating Training and Testing Sets
Now that we have a better idea of what is happening, we are ready to create Estimation (training) and Validation (testing) sets. We can use the Create Samples Tool and set each of these to 50% since this data is so small. Just note as you apply this to new problems, you will always want to have a healthy amount of training data (typically 60-80% of the data) and a good amount of testing data (20-40%). These are just rules of thumb and may vary on your business problem. You can also have a “hold-out” amount of data to help keep your entire process from overfitting. Remember, we want to predict new flowers that are not in our data, not just accurately predict flowers within our current data.
With all the data prep and investigation done, here are the tools I am using – all of which are set up as Out of the Box:
- Decision tree
- Random Forest
- Support Vector Machines
- Neural Networks
- Boosted Model
Important note: I did not use Logistic Regression in this case because that Alteryx Tool handles only binary classification – basically just yes or no. Since I am trying to predict for 3 classes, and I didn’t want to figure out how to trick-out multiple Logistic Regression Tools to predict three classes and pick the highest probability from each of outputs, I simply decided to keep things simple and use tools that can predict multi-class.
Just to be transparent, here is an example of what my settings were for the Neural Network Tool:
For each of these models, you will want to make sure that you are not feeding in the “target” variable you are trying to predict. I mean, how easy is it to get the right answer when it’s given in the data?! You can see in the above picture, I have selected what I wanted to predict (the “target” field), but I have deselected that field from my predictor fields.
Each of these tools has an “O” output, which outputs the actual model that you will use downstream. They also output distinct information out of either their “R” or “I” outputs (R report output and Interactive report output, respectively). These can be very helpful for understanding what the model is doing and also includes some visualizations. Unfortunately, we will have to discuss each model and its visualizations in a different post as that is a whole other thing to understand.
Outputs and Scoring
For this next part, we have different types of outputs. Looking at the picture below, we will work from bottom up:
Now that we have each of the models/algorithms, we can feed them into the Model Comparison Tool (available from the Alteryx Public Gallery) against our test data to see how each performed! To begin, we use a Union Tool to bring all of the “O” outputs together as individual rows, and then feed that into the “M” input of the Model Comparison Tool. For data, we will use our validation data from our Create Samples Tool. Since the validation data has labels on the data, we can see each model’s overall accuracy!
In the report above, we can see the overall Accuracy and F1 score of each model. We want the highest possible value for each of these categories. For those of you new to this, F1 takes a balance of Precision and Recall (actual calculation listed in the report) and gives us a better understanding of truly how well the model predicted classes. Accuracy is not everything!
(Case in point, I could have high accuracy but bad precision or recall. As an example, if I was trying to predict cancer out of 10,000 people, I could predict “Negative” for almost everyone and have really high accuracy. However, the recall (actually predicting Positive when they truly had cancer) would be awful.)
Finally, we can look at the confusion matrix for each model. This gives us the actual vs predicted classes on each record. For the most part, every model did perfect on classifying Setosa. However, each model performed slightly different on Virginica and Versicolor. Looking at our EDA scatter plot above, we can see that these were the two types of Iris that mixed together slightly. There isn’t a clear line we could draw between them, so it makes sense that the algorithm would have trouble as well!
Now that we have compared models, we can see the Neural Network was our best performing model. But how will this actually predict? We can use the Model Comparison Tool to get a prediction output, but this doesn’t append on to our data. Instead, we can use the Score Tool! Here, we input our new data (in this case, the validation data) and the model, and we will get appended columns of probabilities for each of the classes. So in the first row and in the first appended column, we see a 97% probability that that flower is an Iris Setosa. Next, we can translate this into an actual prediction by using a Formula Tool with a series of If statements:
# Predict If max([X_setosa],[X_versicolor], [X_virginica]) == [X_setosa] then “setosa” elseif max([X_setosa],[X_versicolor], [X_virginica]) == [X_versicolor] then “versicolor” elseif else "virginica" endif
We can then use a Select Tool to drop the [X_…] columns. As our testing data has the target, we can verify that our prediction matches the target to get an overall accuracy on new records (though, this was already done in the Model Comparison Tool, but hand-checking on small sets of data is never a bad idea). For new records, you can imagine that you would have the first 4 columns (the measurements) without the target, so you would just be left with the Predict column to use for your analysis.
Saving/Deploying the model
Now that we have made a model and we can interpret its output, we can then make this model available for other workflows. One simple way to do this is to write it out in a shared location as .yxdb file. Another way is to integrate this whole process into a macro by taking in data, transforming it, modeling it, choosing the best model through sorting/filtering, scoring your data, and then outputting it back into a workflow. Lastly, and much more structured for developers and business users alike, you can use Alteryx Promote to push your model to a server, and the business can pull that down into their workflow using the Score Tool via API calls.
So now that we’ve seen how to build and deploy a simple model, you can let your imagination run wild for solving business problems. However, there are still a couple of important things to think about as you are building:
- Maintenance – How often will you need to re-run and maintain this model? You can imagine that over time, you may get more and more data that causes your model to not be as robust as what it was before.
- Bias – Is your model perpetuating a negative feedback loop? For instance, you might have data on something like crime zones, and you can predict which ones are the worst, but you are not getting data where crime is not being reported. This would imply that crime will only happen where you predict it and that your model is biased towards the data it has received. This also very important to think about when your model will impact people’s lives.
- Business Justifications for Accuracy – Much like my bias spiel above, do you truly understand the implications of your model and what you will need? Some models may only need 60% accuracy (like sending mailers to customers) while others may need high accuracy and recall (like predicting cancer). It’s important to understand the full business problem and context while developing.
I hope you found this post informative and encouraging! If there is anything you feel like I left off, needs clarification, or you just straight-up have questions, please feel free to reach out to me or leave a comment below! Thanks to everyone again, and I cannot wait to see you at the next meet-up!