Predictive Workflows in Alteryx

For those of you that were able to join us at the Alteryx NA Central meeting this week (all 150+ of you! WOW!), you were able to see my presentation on a predictive workflow within Alteryx using Python, R, and the native Alteryx tools. Given the popularity of this talk, I thought I would give a quick write-up on the details of how this was built and what-all it is doing. But first, we need to set down some foundations.

To begin, we need to talk about supervised learning and classification. With regard to supervised learning, this is a machine learning problem in which we have labelled data, meaning we know what is true versus what is false based on a feature in our data (Buy vs Did Not Buy; Category A vs Category B vs Category C; Sales Price). Contrast this with unsupervised learning, where the data is not labelled, and we may not know the proper categories based on the data. Finally, we will be trying to predict a categorical value given some data, which means we will be putting new data into classes or classifying new data.

Understanding The Problem

In this workflow, we will break down how to predict different Iris flower types. This is a well-known classification problem: we have measurements on three different types of Iris’ and given new measurements, can you predict what type of Iris the new flower will be: Setosa, Versicolor, or Virginica. We will be using data to teach the algorithms what makes each group, and then compare how well each model performs. And again, all of this can be achieved within Alteryx with little-to-no code!

Machine generated alternative text:
Data Input 
- write 
write to 
(Not why, 
with R 
If 
strings, R to 
we will to 
to 
#ful 
with 
Data Exploration 
for 
'div 
Fie d 
Co o 
Model Tournament 
Mod e 
Decision Tree 
Random Forest 
TEt Widation) 
Validate/Score/Write-Out 
DT score

Bringing In the Data

To begin, we will bring down the data using the Python Tool in Alteryx:

Machine generated alternative text:
Run Alteryx.help() for into about useful functions. 
i.e., Alteryx.read("#l") , Alteryx.write(df, 1) 
Alteryx. getldorkf10wConstant ( "Engi ne. WorkflowDirec tory" ) 
# List alt non-standard packages to be imported by your 
# script here (only missing packages wiLL be installed) 
from ayx import Package 
*Package. 'pandas ' , 'numpy ' ] ) 
from ayx import Alteryx 
import sklearn 
import pandas as pd 
import numpy as np 
from sklearn import datasets 
from sklearn.model_selection import train_test_split 
from sklearn import tree 
datasets. load 
101s 
iris() 
In 
101s 
3.s 
[4.9 
8.2] 
8.2] 
8.2] 
8.2] 
8.4] 
1.7, 
8.3] 
8.2] 
8.2] 
8.1] 
8.2] 
8.2] 
[4.8 
8.1] 
8.1] 
8.2] 
8.4] 
8.4] 
(5.1 
8.3] 
8.3] 
1.7, 
OutC4): 
{'data': 
.1, 
1 
.4, 
1. 
1. 
1 
.4, 
1 
.4, 
1. 
1 
.4, 
1. 
1. 
I 
.6, 
1 
.4, 
1. 
1. 
1. 
1. 
1 
.4, 
1.4, 
[4 
[4 
[4 
[4 
[4. 
.7, 3.2, 
.6, 3.1, 
, 3.6, 
, 3.9, 
, 3.4, 
, 3.4, 
, 2.9, 
.g, 3.1, 
, 3.7, 
, 3.4, 
4. 
4 
, 3.9, 
7, 3.8, 
8, 
7, 
.4, 
i 
axl 
In 
pd .DataFrame(np. concatenate((iris.data, np.array(Ciris . target]) .T), 
columns—iris . feature_names + target' ]) 
df. reset_index(drop=True, inp1ace=True)
View of using the Python Tool in Alteryx – very similar to a standard Jupyter Notebook!

Here is the Production code that you can use to copy and paste into your own Python Tool within Alteryx:

#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
 
 
#################################

#Bring in required libraries (Alteryx API is done by default)

from ayx import Alteryx
 
import sklearn
import pandas as pd
import numpy as np
 
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree
#################################
# Bring in the Iris dataset from SciKitLearn
iris = datasets.load_iris()
#################################
# View the data
iris
#################################
# Turn the data into a Dataframe, which we need to write out of the 
# Python Tool. Also, we have to do some transformations to the data
# as it comes in two arrays.

df = pd.DataFrame(np.concatenate((iris.data, np.array([iris.target]).T), axis=1),
                  columns=iris.feature_names + ['target'])
#################################
# Write out of the Python Tool. Also, we have to use “reset_index”
# in order for the data to write out properly. This also gives us
# a RecordID we can use downstream, if needed.

Alteryx.write(df.reset_index(),1)

Side note: I am choosing to use the Python Tool here as I believe it will help others expand their toolkit. For those who sweat at the thought of coding, you could very easily manually download the files and bring them into Alteryx. That’s the beautiful thing about all of this – there is more than one way to read/manipulate data, and they are all equally powerful!

Exploratory Data Analysis (EDA)

A quick note on EDA: This is probably the most fun, most time-consuming, and most frustrating part of the workflow, depending on your approach. But because these predictive tools are sensitive to the data you are feeding into it, you will need to make sure you are correcting and treating the data before the model tournament (e.g. correct/remove NULLs, correct data types, etc). As an example for this problem, you will want to make sure each of the features you are feeding in are continuous variables (floats or doubles; not strings!). Another hard lesson learned: your column names should not have any special characters and avoid spaces. (These tools are all based on R language, and special characters can cause nasty problems in the underlying code). We can easily clean up column names and data types using the Select Tool. For NULLs, we can use the Data Cleanse Tool. For missing values, you’ll have to impute (fill-in) with your choice of average, median, or a dummy value. You will find that EDA is a bit of an iterative process as you may have to take an initial look, do some data cleanup, and then explore some more.

Machine generated alternative text:
Rename 
Deschption 
index 
sepal length (cm) 
sepal width (cm) 
petal length (cm) 
petal Width (cm) 
Z] target 
• Unknown 
IntS4 
Float 
Float 
Float 
Float 
V String 
Unknown 
sepal_length 
sepal 
width 
petal 
length 
petal 
mdth 
Dynamic or Unknown Fields
Changing the data types and renaming columns. This process was done right after the Python Tool.
Machine generated alternative text:
Data Exploration 
p IT FALL: 
- Target as a continuous variable 
works better for interactive 
charting 
Class Names 
Field Summaries 
Correlations

Because of the size of this data (4 features + 1 target column; 150 rows before creating train/test sets), this is small enough to where we could feed it into the algorithms with low computational cost (AKA it’ll run fast). However, we should do our due diligence and at least check for some structure in our data.

I used a Fields Analysis Tool to check the “health” of my data. Most importantly, I wanted to make sure I didn’t have any NULLs that would cause problems downstream. This same check could also be done in a Browse Tool, but I felt it was important to keep this in my workflow for future use cases where there could be many more features and different data types. Browse is good for digging into your data; Fields Analysis is good for better for quick insights.

The Correlations Tool is a neat little tool in the predictive toolkit which allows us to check different correlations by looking at a heat map of correlations as well as its corresponding scatter plot to see the relationship. Clicking through each one, we can get an idea of how each of the variables interact. Interestingly enough, the fields “petal_length” and “petal_width” are very positively correlated.

Machine generated alternative text:
Correlation Matrix with ScatterPlot 
tal length 
0.96 
petal width 
6 
5 
4 
3 
2 
8 
0.5 
8 
2.0 
15 
petal_width 
25 
The left panel is an image of a correlation matrix, with blue = -1 and 
red = +1. Hover over pixels In the correlation matrix on the left to see 
the values; click to see the corresponding scatterplot on the nght The 
variables have been clustered based on degree of correlation, so that 
highly correlated variables appear adjacent to each other.

So seeing a correlation is good, but how does this translate into groups? For instance, are my flowers mixed across the correlation, or are they pretty well clustered? Fortunately, we can use Interactive Charting Tool to layer on the target variable to find out! Looking at the picture below, the groups are pretty well-defined with only some light mixing of the latter two groups (this will come into play later). For now, we have the information we need to move forward

Machine generated alternative text:
Petal Length vs Width 
petal_length
Checking the clustering of Iris Types

Creating Training and Testing Sets

Now that we have a better idea of what is happening, we are ready to create Estimation (training) and Validation (testing) sets. We can use the Create Samples Tool and set each of these to 50% since this data is so small. Just note as you apply this to new problems, you will always want to have a healthy amount of training data (typically 60-80% of the data) and a good amount of testing data (20-40%). These are just rules of thumb and may vary on your business problem. You can also have a “hold-out” amount of data to help keep your entire process from overfitting. Remember, we want to predict new flowers that are not in our data, not just accurately predict flowers within our current data.

Model Tournament

Machine generated alternative text:
Mod e 
Decision Tree 
Random Forest
Each of these models are being fed the “Estimation” data from my Create Samples Tool.

With all the data prep and investigation done, here are the tools I am using – all of which are set up as Out of the Box:

  • Decision tree
  • Random Forest
  • Support Vector Machines
  • Neural Networks
  • Boosted Model

Important note: I did not use Logistic Regression in this case because that Alteryx Tool handles only binary classification – basically just yes or no. Since I am trying to predict for 3 classes, and I didn’t want to figure out how to trick-out multiple Logistic Regression Tools to predict three classes and pick the highest probability from each of outputs, I simply decided to keep things simple and use tools that can predict multi-class.

Just to be transparent, here is an example of what my settings were for the Neural Network Tool:

Machine generated alternative text:
Required parameters 
Model name 
NN Iris 
Model customization (optional) 
Graphics Options 
Select the tar et field 
target 
Select the redictor fields select one or more 
Z] sepal_wld 
Z] petal_length 
Z] petal_width 
target 
use sampling weights in model estimation? (Optional). 
The number of nodes in the hidden layer 
Z] Include effect plots 
All None

For each of these models, you will want to make sure that you are not feeding in the “target” variable you are trying to predict. I mean, how easy is it to get the right answer when it’s given in the data?! You can see in the above picture, I have selected what I wanted to predict (the “target” field), but I have deselected that field from my predictor fields.

Each of these tools has an “O” output, which outputs the actual model that you will use downstream. They also output distinct information out of either their “R” or “I” outputs (R report output and Interactive report output, respectively). These can be very helpful for understanding what the model is doing and also includes some visualizations. Unfortunately, we will have to discuss each model and its visualizations in a different post as that is a whole other thing to understand.

Outputs and Scoring

For this next part, we have different types of outputs. Looking at the picture below, we will work from bottom up:

Machine generated alternative text:
Validate/Score/Write-Out

Model Comparison

Machine generated alternative text:
Report profile 
I of I Fields 
Records I to 8 
Record 
Layout 
Model Comparison Report 
Fit and error measures 
Accuracy_vlrg.mca 
Model 
NN Iris 
SVM Iris Petalorns 
SVM Iris All 
OT Iris 
Iris BoostedM 
Accuracy 
09867 
0.9600 
0.9733 
0.9733 
0.9067 
0.9600 
09848 
0.9619 
0.9734 
0.9770 
0.9159 
0.9619 
Accuracy_setosa 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
1.0000 
Accuracy 
versicolor 
0.9545 
0.9545 
1.0000 
0.9545 
0.9545 
1.0000 
0.9310 
ogsss 
0.9310 
0.7931 
0.9310 
Model: model names in the current comparison. 
Accuracy: overall accuracy, number of correct predictions of all classes divided by total sample number. 
Accuracy_[class name]: accuracy of Class [class name] is defined as the number of cases that are correctly predicted to be Class [class name] 
divided by the total number of cases that actually belong to Class [class name], this measure is also known es recall. 
AUC: area under the ROC curve, only available for two-class classification. 
Fl: Fl score, 2 * precision recall / (precision + recall). The precision measure is the percentage of actual members of a class that were predicted to 
be in thet class divided by the total number of cases predicted to be in thet class. In situations where there are three or mare classes, average 
precision and average recall values across classes are used to calculate the Fl score. 
Confusion matrix of DT Iris 
Actual 
Predicted setosa 
Predicted versicolor 
Predicted_virginica 
Confusion matrix of Iris BoostedM 
Actual 
Predicted setosa 
Predicted versicolor 
Predicted_virginica 
Confusion matrix of NN Iris 
Actual 
Predicted setosa 
Predicted versicolor 
Predicted_virginica 
Confusion matrix of RF Iris 
Actual 
Predicted setosa 
Predicted versicolor 
Predicted_virginica 
Confusion matrix of SVM Iris All 
Actual 
Predicted setosa 
Predicted versicolor 
Predicted_virginica 
setosa 
setosa 
setosa 
setosa 
setosa 
Actual 
Actual 
Actual 
Actual 
Actual 
versicolor 
versicolor 
versicolor 
versicolor 
versicolor 
Actual 
Actual 
Actual 
Actual 
Actual 
virgimca 
virgimca 
virgimca 
virgimca 
virgimca

Now that we have each of the models/algorithms, we can feed them into the Model Comparison Tool (available from the Alteryx Public Gallery) against our test data to see how each performed! To begin, we use a Union Tool to bring all of the “O” outputs together as individual rows, and then feed that into the “M” input of the Model Comparison Tool. For data, we will use our validation data from our Create Samples Tool. Since the validation data has labels on the data, we can see each model’s overall accuracy!

In the report above, we can see the overall Accuracy and F1 score of each model. We want the highest possible value for each of these categories. For those of you new to this, F1 takes a balance of Precision and Recall (actual calculation listed in the report) and gives us a better understanding of truly how well the model predicted classes. Accuracy is not everything!

(Case in point, I could have high accuracy but bad precision or recall. As an example, if I was trying to predict cancer out of 10,000 people, I could predict “Negative” for almost everyone and have really high accuracy. However, the recall (actually predicting Positive when they truly had cancer) would be awful.)

Finally, we can look at the confusion matrix for each model. This gives us the actual vs predicted classes on each record. For the most part, every model did perfect on classifying Setosa. However, each model performed slightly different on Virginica and Versicolor. Looking at our EDA scatter plot above, we can see that these were the two types of Iris that mixed together slightly. There isn’t a clear line we could draw between them, so it makes sense that the algorithm would have trouble as well!

Scoring

Machine generated alternative text:
8 of 8 Fields 
10 
11 
12 
13 
Cell Viewer • 75 records displayed 
sepal_len gth 
4.9 
2 4.7 
5 4.9 
5.7 
4.6 
5.1 
sepal_wid th 
petal_len gth 
petal_wid th 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
setosz 
0.970364 
0.976138 
0.977819 
0.964016 
0.970774 
0.972069 
0.971403 
0.977176 
0.98459 
0.97827 
0.971833 
0.983185 
0.960401 
0.018526 
0.017194 
0.023111 
0.022015 
0.022553 
0.017592 
0.01155 
0.016825 
0.012634 
0.032122 
X_virginica 
0.006183 
0.005336 
0.007132 
0.00611 s 
0.005916 
0.004905 
0.004181 
0.007476

Now that we have compared models, we can see the Neural Network was our best performing model. But how will this actually predict? We can use the Model Comparison Tool to get a prediction output, but this doesn’t append on to our data. Instead, we can use the Score Tool! Here, we input our new data (in this case, the validation data) and the model, and we will get appended columns of probabilities for each of the classes. So in the first row and in the first appended column, we see a 97% probability that that flower is an Iris Setosa. Next, we can translate this into an actual prediction by using a Formula Tool with a series of If statements:

# Predict
If max([X_setosa],[X_versicolor], [X_virginica]) == [X_setosa] then “setosa” elseif max([X_setosa],[X_versicolor], [X_virginica]) == [X_versicolor] then “versicolor” elseif
else "virginica"
endif

We can then use a Select Tool to drop the [X_…] columns. As our testing data has the target, we can verify that our prediction matches the target to get an overall accuracy on new records (though, this was already done in the Model Comparison Tool, but hand-checking on small sets of data is never a bad idea). For new records, you can imagine that you would have the first 4 columns (the measurements) without the target, so you would just be left with the Predict column to use for your analysis.

Saving/Deploying the model

Now that we have made a model and we can interpret its output, we can then make this model available for other workflows. One simple way to do this is to write it out in a shared location as .yxdb file. Another way is to integrate this whole process into a macro by taking in data, transforming it, modeling it, choosing the best model through sorting/filtering, scoring your data, and then outputting it back into a workflow. Lastly, and much more structured for developers and business users alike, you can use Alteryx Promote to push your model to a server, and the business can pull that down into their workflow using the Score Tool via API calls. 

Next Steps

So now that we’ve seen how to build and deploy a simple model, you can let your imagination run wild for solving business problems. However, there are still a couple of important things to think about as you are building:

  • Maintenance – How often will you need to re-run and maintain this model? You can imagine that over time, you may get more and more data that causes your model to not be as robust as what it was before.
  • Bias – Is your model perpetuating a negative feedback loop? For instance, you might have data on something like crime zones, and you can predict which ones are the worst, but you are not getting data where crime is not being reported. This would imply that crime will only happen where you predict it and that your model is biased towards the data it has received. This also very important to think about when your model will impact people’s lives.
  • Business Justifications for Accuracy – Much like my bias spiel above, do you truly understand the implications of your model and what you will need? Some models may only need 60% accuracy (like sending mailers to customers) while others may need high accuracy and recall (like predicting cancer). It’s important to understand the full business problem and context while developing.

I hope you found this post informative and encouraging! If there is anything you feel like I left off, needs clarification, or you just straight-up have questions, please feel free to reach out to me or leave a comment below! Thanks to everyone again, and I cannot wait to see you at the next meet-up!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this:
search previous next tag category expand menu location phone mail time cart zoom edit close