Dataset: SMS Spam Collection Data Set.
Dataset size and schema: 5,574 rows, 2 string columns.
Dataset description: A single file containing short texts along with correct binary categorization (spam or ham). Strongly biased toward the ham class (~87%).
Business purpose: Reducing the amount of text message spam in cellular networks.
Dataset credits: Lichman, M. (2013). UCI Machine Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of Information and Computer Science.
The data is provided in a form of 2-column, tab-separated CSV file. For convenience, at the top of
the file we add a row with names of the columns: class and text separated with a Tab. Using
the Read Data Frame operation and Data Source with CSV
format and
Tab
selected as a column separator
, we are able to pull the data into Seahorse. Let’s take a
look at the sample by clicking the report icon that appeared after running this single-node workflow.
We can see the distribution of the class column by clicking on a graph icon in its header. This tells us that there are more “good” messages than spam, which in turn leads to the conclusion that the trivial classifier that always returns “not spam” can serve as our baseline to which we can later compare our model.
Now we need to do some preprocessing before this data is usable in any classifier.
We will use a String Indexer and
a Tokenize with Regex operations to transform
the data into a more usable form. The String Indexer will replace words ham
and spam
with 0
and 1
values, and the Tokenize with Regex will allow us to transform the messages into separate
words. We will use the [a-zA-Z]+
regex pattern for the Tokenizer. We also set the gaps
parameter
to false
to indicate that the pattern describes the actual tokens, and not the gaps between them.
Afterwards, we use a Remove Stop Words operation
in order to filter out the most common stop words.
The preprocessing part of our workflow looks like this:
The next step is to Split our data into two parts - the data
that we’ll use to train our model and the data on which our model will be tested. We use a 0.7
split ratio
which means, that 70% of the data will be used for training.
The first thing to do here is to change the representation of the text column. Instead of an array
of words, we’d like a vector of 0
and 1
, where each column represents
whether a specific word appears in the corresponding message. That’s exactly what a
Count Vectorizer is for: it creates a
dictionary of words from the entire dataset (each word is mapped to a number) and for each row it
creates a vector in which column N is set to 1
if the word N appears in the text
message. Aside from setting the name of the output column, let’s leave its parameters as they are by
default. In particular, Min Different Documents
should be 1
. We are using
the Count Vectorizer operation after splitting the dataset in order to avoid test dataset knowledge
leaking into our model.
Now, we can do the actual training. We need to use a Fit operation in order to train a Logistic Regression classifier on our data.
That’s it - coming out of the Fit is our trained model. We can apply it to the test part of the dataset and find out how accurate it is.
Since we split our data only after some preprocessing, all we need to do, in order to test our model, is to transform it with the transformer created by the Count Vectorizer operation and then with the Logistic Regression model.
Afterwards, all that’s left to do is to evaluate our model.
This can be done either by manually implementing a simple accuracy metric with a
Python Transformation,
or using available evaluation tools like a
Binary Classification Evaluator,
which provides two metrics: Area under ROC
and Area under PR
. However, we suggest using accuracy
here since it is an intuitive metric with simple interpretation: it just counts the number of
correctly classified messages.
A simple Python Transformation with the following code produces a human-readable table.
Further exploration of the resulting DataFrame with a Python Notebook yields the messages that were incorrectly classified by our model. It’s a good starting point for the process of improving our model.
And here is the entire workflow. The accuracy of the final model on the test dataset is above 97%, which compared to the baseline (the naive model always returning “ham”) of 87% is a big improvement.
Now that we have the full workflow, we are free to experiment a bit with algorithms’ parameters.
For example, we could change Min Different Documents
in the Count Vectorizer to include only words
appearing in more than a single text message. Another idea is to tweak the regex in the Tokenize
with Regex - maybe it should simply be replaced with a Tokenize operation? Maybe we could change
the Logistic Regression parameters or even use a completely different classification algorithm?