SPAM Detection with Decision Trees

Huge number of spam emails being sent across the Internet each day. Most email providers offer a spam filter that automatically flags likely spam messages and separates them. These filters use a number of techniques. In this homework problem, I built a CART model to detect spam mail. Also performance depends on correct specification of spam/non-spam mails in the test subset.

CART Modeling

# Necessary libraries
library(rpart) #To construct CART models
library(rpart.plot) 
library(rattle) #For visualization
library(dplyr) #For data manipulation
# Begin by loading the dataset
library(readr)
load("D:/Users/tkartalkaya/Desktop/Verisetleri/spam_data.RData")
head(spam_data)
## # A tibble: 6 x 59
##   train_test spam_or_not    V1    V2    V3    V4    V5    V6    V7    V8
##        <dbl>       <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1          0           1  0.00  0.64  0.64     0  0.32  0.00  0.00  0.00
## 2          0           1  0.21  0.28  0.50     0  0.14  0.28  0.21  0.07
## 3          0           1  0.06  0.00  0.71     0  1.23  0.19  0.19  0.12
## 4          0           1  0.00  0.00  0.00     0  0.63  0.00  0.31  0.63
## 5          0           1  0.00  0.00  0.00     0  0.63  0.00  0.31  0.63
## 6          0           1  0.00  0.00  0.00     0  1.85  0.00  0.00  1.85
## # ... with 49 more variables: V9 <dbl>, V10 <dbl>, V11 <dbl>, V12 <dbl>,
## #   V13 <dbl>, V14 <dbl>, V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>,
## #   V19 <dbl>, V20 <dbl>, V21 <dbl>, V22 <dbl>, V23 <dbl>, V24 <dbl>,
## #   V25 <dbl>, V26 <dbl>, V27 <dbl>, V28 <dbl>, V29 <dbl>, V30 <dbl>,
## #   V31 <dbl>, V32 <dbl>, V33 <dbl>, V34 <dbl>, V35 <dbl>, V36 <dbl>,
## #   V37 <dbl>, V38 <dbl>, V39 <dbl>, V40 <dbl>, V41 <dbl>, V42 <dbl>,
## #   V43 <dbl>, V44 <dbl>, V45 <dbl>, V46 <dbl>, V47 <dbl>, V48 <dbl>,
## #   V49 <dbl>, V50 <dbl>, V51 <dbl>, V52 <dbl>, V53 <dbl>, V54 <dbl>,
## #   V55 <dbl>, V56 <int>, V57 <int>
# How many emails are in the dataset?
dim(spam_data)
## [1] 4601   59
# Convert the dependent variable to a factor.
spam_data$spam_or_not <- as.factor(spam_data$spam_or_not)

# How many of the emails are spam?
table(spam_data$spam_or_not==0)
## 
## FALSE  TRUE 
##  1813  2788
# Split data to train and test.
spam_data_train<-spam_data%>%filter(train_test==0)%>% select(-train_test)
spam_data_test<-spam_data%>%filter(train_test==1)%>% select(-train_test)
# Build the model with the training data
# A CART model called spam_data_CART, using the default parameters to train the model.
spam_data_CART <- rpart(spam_or_not ~ ., data=spam_data_train, method = 'class')
printcp(spam_data_CART)# display the results
## 
## Classification tree:
## rpart(formula = spam_or_not ~ ., data = spam_data_train, method = "class")
## 
## Variables actually used in tree construction:
## [1] V16 V25 V52 V53 V57 V7 
## 
## Root node error: 1605/4101 = 0.39137
## 
## n= 4101 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.481620      0   1.00000 1.00000 0.019473
## 2 0.143925      1   0.51838 0.53956 0.016285
## 3 0.049221      2   0.37445 0.42056 0.014795
## 4 0.037383      3   0.32523 0.34766 0.013680
## 5 0.030530      4   0.28785 0.31963 0.013200
## 6 0.011838      5   0.25732 0.28037 0.012471
## 7 0.010000      6   0.24548 0.26604 0.012186
fancyRpartPlot(spam_data_CART)

summary(spam_data_CART) # detailed summary of splits
## Call:
## rpart(formula = spam_or_not ~ ., data = spam_data_train, method = "class")
##   n= 4101 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.48161994      0 1.0000000 1.0000000 0.01947331
## 2 0.14392523      1 0.5183801 0.5395639 0.01628457
## 3 0.04922118      2 0.3744548 0.4205607 0.01479536
## 4 0.03738318      3 0.3252336 0.3476636 0.01367989
## 5 0.03052960      4 0.2878505 0.3196262 0.01319973
## 6 0.01183801      5 0.2573209 0.2803738 0.01247074
## 7 0.01000000      6 0.2454829 0.2660436 0.01218607
## 
## Variable importance
## V53  V7 V23 V24 V52 V56 V20  V9 V25 V57 V26 V55 V16 V21 V50  V5 V31 
##  29  13  10  10   7   7   5   5   4   2   2   1   1   1   1   1   1 
## 
## Node number 1: 4101 observations,    complexity param=0.4816199
##   predicted class=0  expected loss=0.391368  P(node) =1
##     class counts:  2496  1605
##    probabilities: 0.609 0.391 
##   left son=2 (3092 obs) right son=3 (1009 obs)
##   Primary splits:
##       V53 < 0.0555 to the left,  improve=647.0601, (0 missing)
##       V52 < 0.0795 to the left,  improve=637.7163, (0 missing)
##       V7  < 0.01   to the left,  improve=527.5502, (0 missing)
##       V16 < 0.075  to the left,  improve=498.3240, (0 missing)
##       V21 < 0.595  to the left,  improve=482.5029, (0 missing)
##   Surrogate splits:
##       V23 < 0.055  to the left,  agree=0.842, adj=0.356, (0 split)
##       V24 < 0.035  to the left,  agree=0.835, adj=0.331, (0 split)
##       V20 < 0.025  to the left,  agree=0.797, adj=0.173, (0 split)
##       V56 < 71.5   to the left,  agree=0.796, adj=0.169, (0 split)
##       V9  < 0.18   to the left,  agree=0.794, adj=0.164, (0 split)
## 
## Node number 2: 3092 observations,    complexity param=0.1439252
##   predicted class=0  expected loss=0.2309185  P(node) =0.7539624
##     class counts:  2378   714
##    probabilities: 0.769 0.231 
##   left son=4 (2809 obs) right son=5 (283 obs)
##   Primary splits:
##       V7  < 0.055  to the left,  improve=285.7257, (0 missing)
##       V52 < 0.0915 to the left,  improve=250.0214, (0 missing)
##       V16 < 0.135  to the left,  improve=236.2023, (0 missing)
##       V21 < 0.445  to the left,  improve=139.5265, (0 missing)
##       V55 < 3.6835 to the left,  improve=136.5273, (0 missing)
##   Surrogate splits:
##       V56 < 131.5  to the left,  agree=0.911, adj=0.032, (0 split)
##       V20 < 1.635  to the left,  agree=0.910, adj=0.018, (0 split)
##       V54 < 0.826  to the left,  agree=0.910, adj=0.018, (0 split)
##       V4  < 8.115  to the left,  agree=0.909, adj=0.007, (0 split)
##       V17 < 4.325  to the left,  agree=0.909, adj=0.007, (0 split)
## 
## Node number 3: 1009 observations,    complexity param=0.0305296
##   predicted class=1  expected loss=0.1169475  P(node) =0.2460376
##     class counts:   118   891
##    probabilities: 0.117 0.883 
##   left son=6 (61 obs) right son=7 (948 obs)
##   Primary splits:
##       V25 < 0.4    to the right, improve=79.95414, (0 missing)
##       V26 < 0.12   to the right, improve=38.11919, (0 missing)
##       V52 < 0.0495 to the left,  improve=34.98541, (0 missing)
##       V37 < 0.085  to the right, improve=32.13206, (0 missing)
##       V27 < 0.21   to the right, improve=31.55716, (0 missing)
##   Surrogate splits:
##       V26 < 0.305  to the right, agree=0.966, adj=0.443, (0 split)
##       V31 < 0.05   to the right, agree=0.949, adj=0.164, (0 split)
##       V28 < 0.025  to the right, agree=0.947, adj=0.131, (0 split)
##       V27 < 0.225  to the right, agree=0.945, adj=0.098, (0 split)
##       V29 < 0.075  to the right, agree=0.945, adj=0.098, (0 split)
## 
## Node number 4: 2809 observations,    complexity param=0.04922118
##   predicted class=0  expected loss=0.1626913  P(node) =0.6849549
##     class counts:  2352   457
##    probabilities: 0.837 0.163 
##   left son=8 (2452 obs) right son=9 (357 obs)
##   Primary splits:
##       V52 < 0.3775 to the left,  improve=164.13240, (0 missing)
##       V16 < 0.225  to the left,  improve=139.18240, (0 missing)
##       V55 < 3.687  to the left,  improve= 68.03708, (0 missing)
##       V21 < 0.865  to the left,  improve= 58.60639, (0 missing)
##       V25 < 0.025  to the right, improve= 57.38702, (0 missing)
##   Surrogate splits:
##       V16 < 2.415  to the left,  agree=0.876, adj=0.028, (0 split)
##       V23 < 0.62   to the left,  agree=0.876, adj=0.028, (0 split)
##       V24 < 3.305  to the left,  agree=0.874, adj=0.006, (0 split)
##       V13 < 2.53   to the left,  agree=0.873, adj=0.003, (0 split)
##       V18 < 3.93   to the left,  agree=0.873, adj=0.003, (0 split)
## 
## Node number 5: 283 observations
##   predicted class=1  expected loss=0.09187279  P(node) =0.06900756
##     class counts:    26   257
##    probabilities: 0.092 0.908 
## 
## Node number 6: 61 observations
##   predicted class=0  expected loss=0.09836066  P(node) =0.01487442
##     class counts:    55     6
##    probabilities: 0.902 0.098 
## 
## Node number 7: 948 observations
##   predicted class=1  expected loss=0.0664557  P(node) =0.2311631
##     class counts:    63   885
##    probabilities: 0.066 0.934 
## 
## Node number 8: 2452 observations
##   predicted class=0  expected loss=0.09747145  P(node) =0.597903
##     class counts:  2213   239
##    probabilities: 0.903 0.097 
## 
## Node number 9: 357 observations,    complexity param=0.03738318
##   predicted class=1  expected loss=0.3893557  P(node) =0.08705194
##     class counts:   139   218
##    probabilities: 0.389 0.611 
##   left son=18 (168 obs) right son=19 (189 obs)
##   Primary splits:
##       V57 < 64.5   to the left,  improve=53.08715, (0 missing)
##       V56 < 10.5   to the left,  improve=49.11356, (0 missing)
##       V55 < 2.654  to the left,  improve=43.90820, (0 missing)
##       V16 < 0.04   to the left,  improve=38.03825, (0 missing)
##       V21 < 0.765  to the left,  improve=21.76457, (0 missing)
##   Surrogate splits:
##       V56 < 12.5   to the left,  agree=0.854, adj=0.690, (0 split)
##       V55 < 2.8055 to the left,  agree=0.754, adj=0.476, (0 split)
##       V21 < 0.115  to the left,  agree=0.731, adj=0.429, (0 split)
##       V50 < 0.008  to the left,  agree=0.709, adj=0.381, (0 split)
##       V5  < 0.065  to the left,  agree=0.697, adj=0.357, (0 split)
## 
## Node number 18: 168 observations,    complexity param=0.01183801
##   predicted class=0  expected loss=0.3214286  P(node) =0.04096562
##     class counts:   114    54
##    probabilities: 0.679 0.321 
##   left son=36 (147 obs) right son=37 (21 obs)
##   Primary splits:
##       V16 < 0.775  to the left,  improve=19.108840, (0 missing)
##       V55 < 2.654  to the left,  improve=11.583360, (0 missing)
##       V52 < 0.8045 to the left,  improve= 8.926531, (0 missing)
##       V56 < 8.5    to the left,  improve= 7.012698, (0 missing)
##       V57 < 22.5   to the left,  improve= 6.549351, (0 missing)
##   Surrogate splits:
##       V49 < 0.294  to the left,  agree=0.893, adj=0.143, (0 split)
##       V1  < 1.39   to the left,  agree=0.887, adj=0.095, (0 split)
##       V6  < 1.145  to the left,  agree=0.887, adj=0.095, (0 split)
##       V18 < 3.84   to the left,  agree=0.887, adj=0.095, (0 split)
##       V55 < 3.757  to the left,  agree=0.887, adj=0.095, (0 split)
## 
## Node number 19: 189 observations
##   predicted class=1  expected loss=0.1322751  P(node) =0.04608632
##     class counts:    25   164
##    probabilities: 0.132 0.868 
## 
## Node number 36: 147 observations
##   predicted class=0  expected loss=0.2312925  P(node) =0.03584492
##     class counts:   113    34
##    probabilities: 0.769 0.231 
## 
## Node number 37: 21 observations
##   predicted class=1  expected loss=0.04761905  P(node) =0.005120702
##     class counts:     1    20
##    probabilities: 0.048 0.952
# What is the testing set accuracy of spamCART, using a threshold of 0.5 for predictions?
predPercCART.test <- predict(spam_data_CART, newdata = spam_data_test)[ , 2]
head(predPercCART.test)
##          1          2          3          4          5          6 
## 0.93354430 0.90812721 0.09747145 0.93354430 0.23129252 0.93354430
predCART.test <- ifelse(predPercCART.test > 0.5, 1, 0)
table(predCART.test, spam_data_test$spam_or_not)
##              
## predCART.test   0   1
##             0 278  40
##             1  14 168

Plots

# finally, lets get a graphical representation of the tree
bestcp <- spam_data_CART$cptable[which.min(spam_data_CART$cptable[,"xerror"]),"CP"]

# Prune the tree using the best cp.
tree.pruned <- prune(spam_data_CART, cp = bestcp)

# basic tree plot
plot(tree.pruned)
text(tree.pruned, cex = 0.8, use.n = TRUE, xpd = TRUE)

#More readable plot with bestcp
only_count <- function(x, labs, digits, varlen)
{
  paste(x$frame$n)
}

boxcols <- c("green", "orange")[tree.pruned$frame$yval]

par(xpd=TRUE)
prp(tree.pruned, faclen = 0, cex = 0.8, node.fun=only_count, box.col = boxcols,,main = 'Classification Tree for Spam')#put the counts within the circles
legend("bottomright", legend = c("not_spam","spam"), fill = c("green", "orange"),
       title = "Group")

Conclusion

CART is a way that can be used to show the probability of being in any hierarchical group. Above figures provides a visual of the technique in action.

The tree has splits that lead to terminal nodes. Each split is basically an if or then statement. In the first split, V53 < 0.056,(V53 - char_freq_$) then the response is splits. Take the far right node in basic tree.pruned plot as an example, 63/885 under “1” means 63 people that actually spam and 885 that actually predicted as spam.

References

  1. https://mef-bda503.github.io/files/assignment_spam_data.html

  2. https://mef-bda503.github.io/files/intro_to_ml_2.html

  3. https://github.com/feelosophy13/spam_prediction_with_text_analytics/blob/master/script.R

  4. https://rpubs.com/minma/cart_with_rpart