Support Vector Machine

In this module, we introduce the Support Vector Machine (SVM) algorithm, a powerful, but simple supervised learning approach to predicting data. For classification tasks, the SVM algorithm attempts to divide data in the feature space into distinct categories. By default, this division is performed by constructing hyperplanes that optimally divide the data. For regression, the hyperplanes are constructed to map the distribution of data. In both cases, these hyperplanes map linear structures in a non-probabilistic manner. By employing a kernel trick, however, we can transform non-linear data sets into linear ones, thus enabling SVM to be applied to non-linear problems.

SVMs are powerful algorithms that have gained widespread popularity. This is due partly to the fact that they are effective in high dimensional feature spaces, including those problems where the number of features is similar to or slightly exceeds the number of instances. They can also be memory efficient since only the support vectors are needed to compute the hyperplanes. Finally, by using different kernels, SVM can be applied to a wide range of learning tasks. On the other hand, these models are black boxes, and it can be difficult to explain how they operate, especially on new instances. They do not, by default, provide probability estimates, since the hyperplane is constructed to cleanly divide the training data.

In this module, we first explore the basic formalism of the SVM algorithm, including the construction of hyperplanes and the kernel trick, which enables SVM to be applied to non-linear problems. Next, we explore the application of SVM to classification problems, which is known as support vector classification, or SVC. To introduce this topic, we will once again use the Iris data to construct an SVC estimator, explore the resulting performance and decision surface, before looking at the effect of different hyperparameter values. Next, we will switch to a more complex data set, the adult data demonstrated in the Introduction to Decision Tree module, with which we will explore unbalanced classes and more advanced classification performance metrics such as the ROC, AUC, and Lift curve. Finally, we will apply SVM to regression problems, which is known as support vector regression. For this we will use the automobile miles per gallon regression task first presented in the Introduction to Decision Tree module.

Formalism

As was the case with the decision tree, one of the simplest machine learning algorithms to understand and employ is the support vector machine. For classification tasks, this algorithm simply divides the data with hyperplanes into the resulting classes, while for regression, the hyperplanes form a predictive model of the underlying data. These hyperplanes, by default, produce a linear classifier (or regressor) since they are restricted to be linear in the features. However, unlike a decision tree, SVM produces a black box model; we can’t examine the model, especially in higher dimensions, to understand why specific predictions are made. In order to construct the optimal set of hyperplanes, SVM assumes the features are normalized and can be compared equally. Thus, for proper use of an SVM on a data set, we must normalize the features.

Given a set of data with \(n\) features, we can construct many different hyperplanes that divide the data. The SVM algorithm selects the optimal hyperplane by finding the one that produces the largest separation, which is known as the margin, between the data. The hyperplane that accomplishes this goal is known as the maximum-margin hyperplane. For high dimensional data, a set of hyperplanes is constructed, which accomplishes this same task. In cases where the data cannot be cleanly separated, many SVM implementations map the data set into a higher dimensional space by using a kernel function, where the data are linearly separated and construct a set of optimal hyperplanes in this space. This process can also be used to transform a non-linear feature space into a linear (or approximately linear) space where traditional SVM can be applied.

In the rest of this section, we demonstrate the construction of hyperplanes by using the Iris data set. To simplify the visualization of these data and the resulting hyperplanes, we use only two dimensions. Since SVM, by default, provides a linear classification, these hyperplanes will generate linear divisions between classes. After this, we demonstrate how a kernel can be employed to transform a non-linear problem into a linear classification task.

Hyperplanes

To demonstrate how hyperplanes can divide data, we will use the standard Iris classification data set. We first use our helper functions to load the Iris data and subdivide into training and testing. We normalize the data by using the training function. Next, we select only two dimensions: Sepal Width and Petal Width, to use in our subsequent analysis to enable easier visualization of the training data, test data, and hyperplanes.

There Code chunk uses the training data to generate an SVC (don’t worry about the details of doing this right now, they are introduced in the next section). Next, we make a scatter plot of the training data, colored by their label, and display test data with a different symbol. Next, we generate a grid of points through this space and apply the predetermined SVC to generate decisions over this grid (note, this is similar to how we construct decisions surfaces for classification tasks). Finally, the algorithm generates a separate hyperplane to divide between each set of classes. The support vectors used to compute these hyperplanes from the training data.

The hyperplanes shown in the plot are denoted by the solid gray line. We also plot the confidence interval (or upper and lower one-sigma standard deviations) for these hyperplanes, the upper boundary as a blue dashed line, and the lower boundary as a red dashed line. The confidence interval in this case provides an estimate for the uncertainty in the location of the true hyperplane given these training data.

The support are those training data that are used to finalize the selection of the best hyperplane. The training data that anchor the support vectors are enclosed in diamonds. The vector extends from these points to the hyperplane (forming a right angle to the hyperplane). The margin for each support vector is the distance from the support to the hyperplane (or the length of the support vector). It is the combinations of these distances that we seek to minimize when we compute the best hyperplane.

library(caret)
library(tidyverse)
#set the seed :)
set.seed(1)
#get our samples
#using the iris data
#lets split the data 60/40

trainIndex <- createDataPartition(iris$Species, p = .6, list = FALSE, times = 1)

#look at the first few
#head(trainIndex)

#grab the data
SVMTrain <- iris[ trainIndex,]
SVMTest  <- iris[-trainIndex,]


iris_SVM <- train(
  form = factor(Species) ~ .,
  data = SVMTrain,
  #here we add classProbs because we want probs
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE),
  method = "svmLinear",
  preProcess = c("center", "scale"),
  tuneLength = 10)

iris_SVM

Support Vector Machines with Linear Kernel 

90 samples
 4 predictor
 3 classes: 'setosa', 'versicolor', 'virginica' 

Pre-processing: centered (4), scaled (4) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 81, 81, 81, 81, 81, 81, ... 
Resampling results:

  Accuracy   Kappa
  0.9666667  0.95 

Tuning parameter 'C' was held constant at a value of 1

summary(iris_SVM)

Length  Class   Mode 
     1   ksvm     S4

svm_Pred<-predict(iris_SVM,SVMTest,type="prob")

knitr::kable(svm_Pred)%>%
  kableExtra::kable_styling("striped")%>%
  kableExtra::scroll_box(width = "100%",height="300px")

setosa	versicolor	virginica
0.9119900	0.0667780	0.0212320
0.9765986	0.0122676	0.0111338
0.9610351	0.0239773	0.0149876
0.9765749	0.0119585	0.0114666
0.9313294	0.0509379	0.0177327
0.9897054	0.0027805	0.0075140
0.9796212	0.0090689	0.0113100
0.9699558	0.0153175	0.0147267
0.8910858	0.0775756	0.0313386
0.9664487	0.0196413	0.0139099
0.9408605	0.0403384	0.0188011
0.9313729	0.0459459	0.0226812
0.9252151	0.0544007	0.0203842
0.9677491	0.0192658	0.0129851
0.9816657	0.0092066	0.0091277
0.6082704	0.3585779	0.0331517
0.9227571	0.0492752	0.0279676
0.9650908	0.0172381	0.0176711
0.8958313	0.0797438	0.0244249
0.9546095	0.0296492	0.0157413
0.0148830	0.8755405	0.1095765
0.0105902	0.8928634	0.0965464
0.0145113	0.9657228	0.0197658
0.0240048	0.9340854	0.0419098
0.0076557	0.9901118	0.0022325
0.0549154	0.9289241	0.0161605
0.0230323	0.9559715	0.0209962
0.0205240	0.9731425	0.0063335
0.0420572	0.2132697	0.7446731
0.0182047	0.9581886	0.0236067
0.0081877	0.9374497	0.0543626
0.0206738	0.4701043	0.5092219
0.0172608	0.9773076	0.0054315
0.0266853	0.9644647	0.0088500
0.0223888	0.3318986	0.6457126
0.0363084	0.7313350	0.2323566
0.0525471	0.7533945	0.1940585
0.0173036	0.8980045	0.0846920
0.0188206	0.9541401	0.0270393
0.0316689	0.9438493	0.0244817
0.0089516	0.0056648	0.9853836
0.0189757	0.0420633	0.9389610
0.0091081	0.0016704	0.9892215
0.0050442	0.0015931	0.9933626
0.0098296	0.0232422	0.9669282
0.0147971	0.0452636	0.9399392
0.0140854	0.0175944	0.9683202
0.0091980	0.0009526	0.9898494
0.0195324	0.2140618	0.7664057
0.0230968	0.2482140	0.7286892
0.0168276	0.2930305	0.6901419
0.0173931	0.6274968	0.3551101
0.0182957	0.3913999	0.5903043
0.0048002	0.0015258	0.9936740
0.0123775	0.0008602	0.9867622
0.0246667	0.0682397	0.9070936
0.0125091	0.0088514	0.9786395
0.0085294	0.0010148	0.9904558
0.0078415	0.0004231	0.9917355
0.0139480	0.0789715	0.9070805

svmtestpred<-cbind(svm_Pred,SVMTest)

svmtestpred<-svmtestpred%>%
  mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
                            if_else(versicolor>setosa & versicolor>virginica, "versicolor",
                                    if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))

table(svmtestpred$prediction)


    setosa versicolor  virginica 
        20         18         22

confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$Species))

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         20          0         0
  versicolor      0         17         1
  virginica       0          3        19

Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.838, 0.9815)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8500           0.9500
Specificity                 1.0000            0.9750           0.9250
Pos Pred Value              1.0000            0.9444           0.8636
Neg Pred Value              1.0000            0.9286           0.9737
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2833           0.3167
Detection Prevalence        0.3333            0.3000           0.3667
Balanced Accuracy           1.0000            0.9125           0.9375

supportvectors<-SVMTrain[iris_SVM$finalModel@SVindex,]

ggplot(data=SVMTest, mapping = aes(x=Sepal.Width,y=Petal.Width,color=Species))+
  geom_point(alpha=0.5)+
  geom_point(data=svmtestpred, mapping = aes(x=Sepal.Width,y=Petal.Width, color=prediction),shape=6,size=3)+
    geom_point(data=supportvectors, mapping = aes(x=Sepal.Width,y=Petal.Width),shape=4,size=4)+
  theme(legend.title = element_blank())+ggtitle("SVM Demonstration")

svmfit = e1071::svm(Species~., data = SVMTrain, kernel = "linear", cost = 1, scale = TRUE)

# Plot Results
plot(svmfit, SVMTrain, Petal.Width ~ Sepal.Length,
     slice = list(Sepal.Width = 3, Petal.Length = 4))


Call:
svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, 
    scale = FALSE)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  linear 
       cost:  10 

Number of Support Vectors:  6

           X1         X2
1  -1.3406379 -0.5400074
2  -1.2859572 -0.5400074
3  -1.2312766 -0.5400074
4  -1.1765959 -0.5400074
5  -1.1219153 -0.5400074
6  -1.0672346 -0.5400074
7  -1.0125540 -0.5400074
8  -0.9578733 -0.5400074
9  -0.9031927 -0.5400074
10 -0.8485120 -0.5400074

# construct larger random data set
x <- matrix(rnorm(200*2), ncol = 2)
x[1:100,] <- x[1:100,] + 2.5
x[101:150,] <- x[101:150,] - 2.5
y <- c(rep(1,150), rep(2,50))
dat <- data.frame(x=x,y=as.factor(y))

# Plot data
ggplot(data = dat, aes(x = x.2, y = x.1, color = y, shape = y)) + 
  geom_point(size = 2) +
  scale_color_manual(values=c("#000000", "#FF0000")) +
  theme(legend.position = "none")

library(e1071)

# set pseudorandom number generator
set.seed(123)
# sample training data and fit model
train <- base::sample(200,100, replace = FALSE)
svmfit <- svm(y~., data = dat[train,],  kernel = "linear", cost = 10, scale = FALSE)
# plot classifier
plot(svmfit, dat)

Non-Linear Kernels

In many real-world cases, the data of interest are non-linear. In these cases, we can still successfully employ SVM by transforming the data into a space where the data are linearly separated. This process is known as the kernel trick, since we employ a kernel function to perform the mapping.

svmfit <- svm(y~., data = dat[train,], kernel = "radial", gamma = 1, cost = 1)
# plot classifier
plot(svmfit, dat)

Options for SVM

Exercise 1

Use the first code chunk with the iris data.

Change the training to testing split size, for example change from a 60%:40% to a 75%:25%, and to a 50%:50%. Compare the results to the 60/40.
Change the Kernel…use svmPoly for the method. method = "svmLinear", to method = "svmPoly", compare the linear results with the polynomial results.

SVM: Hyperparameters¹

Perhaps the most important hyperparameter for SVC is the kernel hyperparameter, which specifies the type of transformation that should be applied to the training data to determine the optimal set of hyperplanes. In the previous example, we computed and displayed the decision surface for a linear kernel. n the following Code chunks, we compute and display decision surfaces for SVCs that employ different kernel functions: linear, polynomial, radial, and sigmoid. By doing this, the resulting figures demonstrate how these different kernels affect the classification.

Note how the resulting decisions surfaces are no longer dominated by the linear divisions. Each of the last three decision surfaces have curved hyperplanes in the original space, since they transform the original data by using non-linear functions. Radial peforms the best.

Linear

svmfit <- svm(y~., data = dat[train,],  kernel = "linear")
# plot classifier
plot(svmfit, dat)

Radial

svmfit <- svm(y~., data = dat[train,], kernel = "radial")
# plot classifier
plot(svmfit, dat)

Sigmoid

svmfit <- svm(y~., data = dat[train,], kernel = "sigmoid")
# plot classifier
plot(svmfit, dat)

Polynomial

svmfit <- svm(y~., data = dat[train,], kernel = "polynomial")
# plot classifier
plot(svmfit, dat)

Exercise 2

Look back at the SVM’s run for the iris data. What where the hyperparameters chose for the Linear kernel, Radial kernel, and the Polynomial kernel? What are each of these hyperparameters?

Classification: Adult Data

We now transition to a more complex data set, the adult data from the UCI machine learning repository. These data are fully documented online at the UCI website.

# install.packages("devtools")
# devtools::install_github("tyluRp/ucimlr")

adult<-ucimlr::adult


knitr::kable(head(adult,100))%>%
  kableExtra::kable_styling("striped")%>%
  kableExtra::scroll_box(width = "100%",height="300px")

age	workclass	fnlwgt	education	education_num	marital_status	occupation	relationship	race	sex	capital_gain	capital_loss	hours_per_week	native_country	income
50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K
38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	0	40	United-States	<=50K
53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	<=50K
28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	<=50K
49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	<=50K
52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	>50K
31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	0	50	United-States	>50K
42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	0	40	United-States	>50K
37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	0	80	United-States	>50K
30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	India	>50K
23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	0	30	United-States	<=50K
32	Private	205019	Assoc-acdm	12	Never-married	Sales	Not-in-family	Black	Male	0	0	50	United-States	<=50K
40	Private	121772	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Asian-Pac-Islander	Male	0	0	40	NA	>50K
34	Private	245487	7th-8th	4	Married-civ-spouse	Transport-moving	Husband	Amer-Indian-Eskimo	Male	0	0	45	Mexico	<=50K
25	Self-emp-not-inc	176756	HS-grad	9	Never-married	Farming-fishing	Own-child	White	Male	0	0	35	United-States	<=50K
32	Private	186824	HS-grad	9	Never-married	Machine-op-inspct	Unmarried	White	Male	0	0	40	United-States	<=50K
38	Private	28887	11th	7	Married-civ-spouse	Sales	Husband	White	Male	0	0	50	United-States	<=50K
43	Self-emp-not-inc	292175	Masters	14	Divorced	Exec-managerial	Unmarried	White	Female	0	0	45	United-States	>50K
40	Private	193524	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	60	United-States	>50K
54	Private	302146	HS-grad	9	Separated	Other-service	Unmarried	Black	Female	0	0	20	United-States	<=50K
35	Federal-gov	76845	9th	5	Married-civ-spouse	Farming-fishing	Husband	Black	Male	0	0	40	United-States	<=50K
43	Private	117037	11th	7	Married-civ-spouse	Transport-moving	Husband	White	Male	0	2042	40	United-States	<=50K
59	Private	109015	HS-grad	9	Divorced	Tech-support	Unmarried	White	Female	0	0	40	United-States	<=50K
56	Local-gov	216851	Bachelors	13	Married-civ-spouse	Tech-support	Husband	White	Male	0	0	40	United-States	>50K
19	Private	168294	HS-grad	9	Never-married	Craft-repair	Own-child	White	Male	0	0	40	United-States	<=50K
54	NA	180211	Some-college	10	Married-civ-spouse	NA	Husband	Asian-Pac-Islander	Male	0	0	60	South	>50K
39	Private	367260	HS-grad	9	Divorced	Exec-managerial	Not-in-family	White	Male	0	0	80	United-States	<=50K
49	Private	193366	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
23	Local-gov	190709	Assoc-acdm	12	Never-married	Protective-serv	Not-in-family	White	Male	0	0	52	United-States	<=50K
20	Private	266015	Some-college	10	Never-married	Sales	Own-child	Black	Male	0	0	44	United-States	<=50K
45	Private	386940	Bachelors	13	Divorced	Exec-managerial	Own-child	White	Male	0	1408	40	United-States	<=50K
30	Federal-gov	59951	Some-college	10	Married-civ-spouse	Adm-clerical	Own-child	White	Male	0	0	40	United-States	<=50K
22	State-gov	311512	Some-college	10	Married-civ-spouse	Other-service	Husband	Black	Male	0	0	15	United-States	<=50K
48	Private	242406	11th	7	Never-married	Machine-op-inspct	Unmarried	White	Male	0	0	40	Puerto-Rico	<=50K
21	Private	197200	Some-college	10	Never-married	Machine-op-inspct	Own-child	White	Male	0	0	40	United-States	<=50K
19	Private	544091	HS-grad	9	Married-AF-spouse	Adm-clerical	Wife	White	Female	0	0	25	United-States	<=50K
31	Private	84154	Some-college	10	Married-civ-spouse	Sales	Husband	White	Male	0	0	38	NA	>50K
48	Self-emp-not-inc	265477	Assoc-acdm	12	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	40	United-States	<=50K
31	Private	507875	9th	5	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	43	United-States	<=50K
53	Self-emp-not-inc	88506	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	40	United-States	<=50K
24	Private	172987	Bachelors	13	Married-civ-spouse	Tech-support	Husband	White	Male	0	0	50	United-States	<=50K
49	Private	94638	HS-grad	9	Separated	Adm-clerical	Unmarried	White	Female	0	0	40	United-States	<=50K
25	Private	289980	HS-grad	9	Never-married	Handlers-cleaners	Not-in-family	White	Male	0	0	35	United-States	<=50K
57	Federal-gov	337895	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Black	Male	0	0	40	United-States	>50K
53	Private	144361	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	38	United-States	<=50K
44	Private	128354	Masters	14	Divorced	Exec-managerial	Unmarried	White	Female	0	0	40	United-States	<=50K
41	State-gov	101603	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
29	Private	271466	Assoc-voc	11	Never-married	Prof-specialty	Not-in-family	White	Male	0	0	43	United-States	<=50K
25	Private	32275	Some-college	10	Married-civ-spouse	Exec-managerial	Wife	Other	Female	0	0	40	United-States	<=50K
18	Private	226956	HS-grad	9	Never-married	Other-service	Own-child	White	Female	0	0	30	NA	<=50K
47	Private	51835	Prof-school	15	Married-civ-spouse	Prof-specialty	Wife	White	Female	0	1902	60	Honduras	>50K
50	Federal-gov	251585	Bachelors	13	Divorced	Exec-managerial	Not-in-family	White	Male	0	0	55	United-States	>50K
47	Self-emp-inc	109832	HS-grad	9	Divorced	Exec-managerial	Not-in-family	White	Male	0	0	60	United-States	<=50K
43	Private	237993	Some-college	10	Married-civ-spouse	Tech-support	Husband	White	Male	0	0	40	United-States	>50K
46	Private	216666	5th-6th	3	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	0	0	40	Mexico	<=50K
35	Private	56352	Assoc-voc	11	Married-civ-spouse	Other-service	Husband	White	Male	0	0	40	Puerto-Rico	<=50K
41	Private	147372	HS-grad	9	Married-civ-spouse	Adm-clerical	Husband	White	Male	0	0	48	United-States	<=50K
30	Private	188146	HS-grad	9	Married-civ-spouse	Machine-op-inspct	Husband	White	Male	5013	0	40	United-States	<=50K
30	Private	59496	Bachelors	13	Married-civ-spouse	Sales	Husband	White	Male	2407	0	40	United-States	<=50K
32	NA	293936	7th-8th	4	Married-spouse-absent	NA	Not-in-family	White	Male	0	0	40	NA	<=50K
48	Private	149640	HS-grad	9	Married-civ-spouse	Transport-moving	Husband	White	Male	0	0	40	United-States	<=50K
42	Private	116632	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	45	United-States	>50K
29	Private	105598	Some-college	10	Divorced	Tech-support	Not-in-family	White	Male	0	0	58	United-States	<=50K
36	Private	155537	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	<=50K
28	Private	183175	Some-college	10	Divorced	Adm-clerical	Not-in-family	White	Female	0	0	40	United-States	<=50K
53	Private	169846	HS-grad	9	Married-civ-spouse	Adm-clerical	Wife	White	Female	0	0	40	United-States	>50K
49	Self-emp-inc	191681	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	50	United-States	>50K
25	NA	200681	Some-college	10	Never-married	NA	Own-child	White	Male	0	0	40	United-States	<=50K
19	Private	101509	Some-college	10	Never-married	Prof-specialty	Own-child	White	Male	0	0	32	United-States	<=50K
31	Private	309974	Bachelors	13	Separated	Sales	Own-child	Black	Female	0	0	40	United-States	<=50K
29	Self-emp-not-inc	162298	Bachelors	13	Married-civ-spouse	Sales	Husband	White	Male	0	0	70	United-States	>50K
23	Private	211678	Some-college	10	Never-married	Machine-op-inspct	Not-in-family	White	Male	0	0	40	United-States	<=50K
79	Private	124744	Some-college	10	Married-civ-spouse	Prof-specialty	Other-relative	White	Male	0	0	20	United-States	<=50K
27	Private	213921	HS-grad	9	Never-married	Other-service	Own-child	White	Male	0	0	40	Mexico	<=50K
40	Private	32214	Assoc-acdm	12	Married-civ-spouse	Adm-clerical	Husband	White	Male	0	0	40	United-States	<=50K
67	NA	212759	10th	6	Married-civ-spouse	NA	Husband	White	Male	0	0	2	United-States	<=50K
18	Private	309634	11th	7	Never-married	Other-service	Own-child	White	Female	0	0	22	United-States	<=50K
31	Local-gov	125927	7th-8th	4	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	0	40	United-States	<=50K
18	Private	446839	HS-grad	9	Never-married	Sales	Not-in-family	White	Male	0	0	30	United-States	<=50K
52	Private	276515	Bachelors	13	Married-civ-spouse	Other-service	Husband	White	Male	0	0	40	Cuba	<=50K
46	Private	51618	HS-grad	9	Married-civ-spouse	Other-service	Wife	White	Female	0	0	40	United-States	<=50K
59	Private	159937	HS-grad	9	Married-civ-spouse	Sales	Husband	White	Male	0	0	48	United-States	<=50K
44	Private	343591	HS-grad	9	Divorced	Craft-repair	Not-in-family	White	Female	14344	0	40	United-States	>50K
53	Private	346253	HS-grad	9	Divorced	Sales	Own-child	White	Female	0	0	35	United-States	<=50K
49	Local-gov	268234	HS-grad	9	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States	>50K
33	Private	202051	Masters	14	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	50	United-States	<=50K
30	Private	54334	9th	5	Never-married	Sales	Not-in-family	White	Male	0	0	40	United-States	<=50K
43	Federal-gov	410867	Doctorate	16	Never-married	Prof-specialty	Not-in-family	White	Female	0	0	50	United-States	>50K
57	Private	249977	Assoc-voc	11	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	0	40	United-States	<=50K
37	Private	286730	Some-college	10	Divorced	Craft-repair	Unmarried	White	Female	0	0	40	United-States	<=50K
28	Private	212563	Some-college	10	Divorced	Machine-op-inspct	Unmarried	Black	Female	0	0	25	United-States	<=50K
30	Private	117747	HS-grad	9	Married-civ-spouse	Sales	Wife	Asian-Pac-Islander	Female	0	1573	35	NA	<=50K
34	Local-gov	226296	Bachelors	13	Married-civ-spouse	Protective-serv	Husband	White	Male	0	0	40	United-States	>50K
29	Local-gov	115585	Some-college	10	Never-married	Handlers-cleaners	Not-in-family	White	Male	0	0	50	United-States	<=50K
48	Self-emp-not-inc	191277	Doctorate	16	Married-civ-spouse	Prof-specialty	Husband	White	Male	0	1902	60	United-States	>50K
37	Private	202683	Some-college	10	Married-civ-spouse	Sales	Husband	White	Male	0	0	48	United-States	>50K
48	Private	171095	Assoc-acdm	12	Divorced	Exec-managerial	Unmarried	White	Female	0	0	40	England	<=50K
32	Federal-gov	249409	HS-grad	9	Never-married	Other-service	Own-child	Black	Male	0	0	40	United-States	<=50K
76	Private	124191	Masters	14	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	40	United-States	>50K

With the data now loaded into a DataFrame, we can move on to creating our training and testing data sets, and employing support vector classification.

set.seed(1)
#lets split the data 60/40
#obtain stratified sample
adult<-na.omit(adult)

strat_sample <- adult %>%
                  group_by(income) %>%
                  sample_n(size=1000)



adult2<-strat_sample%>%
  mutate(income2=if_else(income==">50K","high","low"))

trainIndex <- createDataPartition(adult2$income2, p = .6, list = FALSE, times = 1)

#look at the first few
#head(trainIndex)

#grab the data
SVMTrain <- adult2[ trainIndex,]
SVMTest  <- adult2[-trainIndex,]

adult_SVM <- train(
  form = factor(income2) ~ age+fnlwgt+education_num+factor(occupation)+factor(race)+factor(sex)+capital_gain+capital_loss+hours_per_week,
  data = SVMTrain,
  #here we add classProbs because we want probs
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE),
  method = "svmLinear",
  preProcess = c("center", "scale"),
  tuneLength = 10)

adult_SVM

Support Vector Machines with Linear Kernel 

1200 samples
   9 predictor
   2 classes: 'high', 'low' 

Pre-processing: centered (23), scaled (23) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1080, 1080, 1080, 1080, 1080, 1080, ... 
Resampling results:

  Accuracy   Kappa    
  0.7491667  0.4983333

Tuning parameter 'C' was held constant at a value of 1

summary(adult_SVM)

Length  Class   Mode 
     1   ksvm     S4

svm_Pred<-predict(adult_SVM,SVMTest,type="prob")

knitr::kable(svm_Pred)%>%
  kableExtra::kable_styling("striped")%>%
  kableExtra::scroll_box(width = "100%",height="300px")

high	low
0.2298858	0.7701142
0.5330057	0.4669943
0.1316893	0.8683107
0.1256735	0.8743265
0.2809993	0.7190007
0.0815047	0.9184953
0.3465596	0.6534404
0.5336569	0.4663431
0.4988050	0.5011950
0.7617270	0.2382730
0.4958880	0.5041120
0.0517233	0.9482767
0.3801871	0.6198129
0.4569662	0.5430338
0.6407200	0.3592800
0.6378163	0.3621837
0.3774407	0.6225593
0.1350271	0.8649729
0.3814795	0.6185205
0.3367334	0.6632666
0.0787305	0.9212695
0.7092115	0.2907885
0.4546531	0.5453469
0.1423667	0.8576333
0.1769277	0.8230723
0.4986068	0.5013932
0.3244419	0.6755581
0.4419024	0.5580976
0.1178395	0.8821605
0.3887635	0.6112365
0.5893909	0.4106091
0.0268380	0.9731620
0.2180456	0.7819544
0.3895422	0.6104578
0.1397225	0.8602775
0.1638762	0.8361238
0.1411433	0.8588567
0.4138488	0.5861512
0.7467694	0.2532306
0.9193351	0.0806649
0.5510110	0.4489890
0.1557906	0.8442094
0.6649308	0.3350692
0.5285029	0.4714971
0.0328148	0.9671852
0.2221032	0.7778968
0.2499383	0.7500617
0.0525109	0.9474891
0.0762576	0.9237424
0.3367808	0.6632192
0.0340452	0.9659548
0.5313022	0.4686978
0.0988553	0.9011447
0.3171202	0.6828798
0.7198858	0.2801142
0.7668867	0.2331133
0.0485200	0.9514800
0.4761942	0.5238058
0.0631308	0.9368692
0.1583881	0.8416119
0.3422983	0.6577017
0.0714493	0.9285507
0.8892643	0.1107357
0.3972739	0.6027261
0.4361551	0.5638449
0.1185618	0.8814382
0.6240382	0.3759618
0.3233164	0.6766836
0.4823618	0.5176382
0.1303821	0.8696179
0.2491436	0.7508564
0.2169764	0.7830236
0.1005420	0.8994580
0.2956025	0.7043975
0.2453193	0.7546807
0.3088340	0.6911660
0.8778762	0.1221238
0.5258827	0.4741173
0.9334787	0.0665213
0.4493963	0.5506037
0.2440632	0.7559368
0.1503131	0.8496869
0.2751675	0.7248325
0.2922035	0.7077965
0.0825594	0.9174406
0.4419773	0.5580227
0.5739482	0.4260518
0.2571443	0.7428557
0.3370905	0.6629095
0.2759191	0.7240809
0.0710084	0.9289916
0.2531664	0.7468336
0.0297986	0.9702014
0.0496748	0.9503252
0.4651968	0.5348032
0.0868628	0.9131372
0.7247856	0.2752144
0.6605513	0.3394487
0.5425958	0.4574042
0.7852043	0.2147957
0.2988353	0.7011647
0.1569330	0.8430670
0.5461940	0.4538060
0.8094839	0.1905161
0.1522417	0.8477583
0.0413380	0.9586620
0.3366710	0.6633290
0.6185888	0.3814112
0.3492165	0.6507835
0.5913590	0.4086410
0.4466232	0.5533768
0.1445539	0.8554461
0.3433107	0.6566893
0.0878677	0.9121323
0.3779073	0.6220927
0.2322042	0.7677958
0.1728286	0.8271714
0.2055380	0.7944620
0.1513738	0.8486262
0.7897300	0.2102700
0.6144217	0.3855783
0.4289459	0.5710541
0.0593315	0.9406685
0.2176802	0.7823198
0.3000130	0.6999870
0.5520823	0.4479177
0.3317818	0.6682182
0.1156814	0.8843186
0.9151708	0.0848292
0.2537162	0.7462838
0.3183018	0.6816982
0.0173990	0.9826010
0.3326536	0.6673464
0.6199448	0.3800552
0.4983956	0.5016044
0.1438643	0.8561357
0.0785071	0.9214929
0.1075513	0.8924487
0.3981930	0.6018070
0.1402571	0.8597429
0.1220680	0.8779320
0.2786983	0.7213017
0.0485072	0.9514928
0.2328261	0.7671739
0.1652898	0.8347102
0.3207233	0.6792767
0.3753139	0.6246861
0.2746279	0.7253721
0.2825416	0.7174584
0.5101881	0.4898119
0.2491259	0.7508741
0.2053736	0.7946264
0.6421537	0.3578463
0.0435202	0.9564798
0.4809367	0.5190633
0.1915729	0.8084271
0.1971034	0.8028966
0.1036388	0.8963612
0.4461896	0.5538104
0.3724018	0.6275982
0.0509872	0.9490128
0.0885065	0.9114935
0.2497923	0.7502077
0.1251483	0.8748517
0.5403636	0.4596364
0.2969192	0.7030808
0.8203751	0.1796249
0.0482874	0.9517126
0.3238921	0.6761079
0.1139523	0.8860477
0.1440124	0.8559876
0.2100698	0.7899302
0.4989461	0.5010539
0.4455765	0.5544235
0.2349304	0.7650696
0.3577987	0.6422013
0.1759688	0.8240312
0.1227527	0.8772473
0.4398292	0.5601708
0.0226161	0.9773839
0.1405094	0.8594906
0.1376299	0.8623701
0.0727469	0.9272531
0.2014835	0.7985165
0.2161456	0.7838544
0.2024991	0.7975009
0.5784239	0.4215761
0.4447847	0.5552153
0.1625302	0.8374698
0.0297427	0.9702573
0.5750666	0.4249334
0.0592767	0.9407233
0.3253257	0.6746743
0.0608541	0.9391459
0.8816738	0.1183262
0.5318553	0.4681447
0.0508449	0.9491551
0.0528116	0.9471884
0.1288250	0.8711750
0.3830280	0.6169720
0.3423462	0.6576538
0.1068159	0.8931841
0.0893764	0.9106236
0.3903923	0.6096077
0.3625536	0.6374464
0.9821441	0.0178559
0.1946445	0.8053555
0.4590459	0.5409541
0.0609067	0.9390933
0.1264294	0.8735706
0.1202146	0.8797854
0.1446316	0.8553684
0.3359699	0.6640301
0.2695158	0.7304842
0.3405867	0.6594133
0.2360056	0.7639944
0.2203121	0.7796879
0.0955991	0.9044009
0.5474179	0.4525821
0.3432556	0.6567444
0.2928164	0.7071836
0.2261825	0.7738175
0.1144646	0.8855354
0.0952108	0.9047892
0.2474737	0.7525263
0.2991989	0.7008011
0.1898298	0.8101702
0.4162792	0.5837208
0.6658973	0.3341027
0.4396011	0.5603989
0.2118269	0.7881731
0.1363500	0.8636500
0.2799980	0.7200020
0.0574848	0.9425152
0.4083507	0.5916493
0.2226116	0.7773884
0.5057358	0.4942642
0.7817352	0.2182648
0.7116623	0.2883377
0.4557896	0.5442104
0.4013575	0.5986425
0.1853605	0.8146395
0.2070047	0.7929953
0.1302657	0.8697343
0.2007116	0.7992884
0.4653387	0.5346613
0.0617496	0.9382504
0.3603469	0.6396531
0.1216741	0.8783259
0.6751969	0.3248031
0.3009942	0.6990058
0.1244660	0.8755340
0.0332685	0.9667315
0.6837230	0.3162770
0.0990721	0.9009279
0.5066505	0.4933495
0.4291731	0.5708269
0.2758318	0.7241682
0.3905442	0.6094558
0.0585285	0.9414715
0.1188539	0.8811461
0.3442067	0.6557933
0.0574626	0.9425374
0.7370516	0.2629484
0.2771862	0.7228138
0.9727284	0.0272716
0.3336696	0.6663304
0.6351135	0.3648865
0.5042175	0.4957825
0.4935839	0.5064161
0.2425768	0.7574232
0.4983596	0.5016404
0.7494811	0.2505189
0.2819377	0.7180623
0.2281022	0.7718978
0.1302140	0.8697860
0.5112567	0.4887433
0.0938057	0.9061943
0.2825134	0.7174866
0.7315996	0.2684004
0.3288201	0.6711799
0.0651174	0.9348826
0.6100466	0.3899534
0.2943843	0.7056157
0.8278530	0.1721470
0.3323172	0.6676828
0.0357647	0.9642353
0.1920922	0.8079078
0.3243908	0.6756092
0.8298799	0.1701201
0.4934494	0.5065506
0.1463131	0.8536869
0.3739562	0.6260438
0.3268379	0.6731621
0.1718458	0.8281542
0.2410416	0.7589584
0.3357486	0.6642514
0.4569658	0.5430342
0.3405932	0.6594068
0.4244736	0.5755264
0.2397825	0.7602175
0.2094153	0.7905847
0.3115667	0.6884333
0.3361418	0.6638582
0.1019792	0.8980208
0.2394216	0.7605784
0.3220529	0.6779471
0.2609476	0.7390524
0.8749518	0.1250482
0.2268541	0.7731459
0.4188170	0.5811830
0.4529951	0.5470049
0.1597642	0.8402358
0.2416751	0.7583249
0.1526768	0.8473232
0.7849912	0.2150088
0.3642923	0.6357077
0.4838354	0.5161646
0.0172603	0.9827397
0.4809572	0.5190428
0.2997730	0.7002270
0.6589780	0.3410220
0.5786960	0.4213040
0.4136996	0.5863004
0.0354855	0.9645145
0.0925139	0.9074861
0.8746500	0.1253500
0.3178674	0.6821326
0.5200157	0.4799843
0.1094776	0.8905224
0.0356159	0.9643841
0.0791809	0.9208191
0.1202353	0.8797647
0.7737679	0.2262321
0.1947133	0.8052867
0.7901797	0.2098203
0.1317221	0.8682779
0.5392775	0.4607225
0.0586444	0.9413556
0.3465434	0.6534566
0.8049025	0.1950975
0.6555482	0.3444518
0.4505686	0.5494314
0.0135768	0.9864232
0.1798094	0.8201906
0.1787803	0.8212197
0.0411358	0.9588642
0.1519016	0.8480984
0.4014749	0.5985251
0.1285270	0.8714730
0.3162577	0.6837423
0.0207346	0.9792654
0.0464553	0.9535447
0.0429720	0.9570280
0.4778974	0.5221026
0.2624594	0.7375406
0.3735712	0.6264288
0.3521915	0.6478085
0.2529342	0.7470658
0.2404974	0.7595026
0.1429879	0.8570121
0.3839298	0.6160702
0.1123006	0.8876994
0.0693670	0.9306330
0.1312659	0.8687341
0.2203047	0.7796953
0.4917373	0.5082627
0.5441394	0.4558606
0.1606906	0.8393094
0.5020775	0.4979225
0.0294843	0.9705157
0.3248665	0.6751335
0.0655108	0.9344892
0.0978820	0.9021180
0.3015840	0.6984160
0.2160991	0.7839009
0.5147651	0.4852349
0.1263796	0.8736204
0.6716549	0.3283451
0.3325230	0.6674770
0.7940113	0.2059887
0.2064080	0.7935920
0.1876578	0.8123422
0.0413496	0.9586504
0.5458474	0.4541526
0.2909039	0.7090961
0.0755082	0.9244918
0.5169199	0.4830801
0.1048727	0.8951273
0.2259089	0.7740911
0.2448550	0.7551450
0.2904885	0.7095115
0.4592639	0.5407361
0.2969656	0.7030344
0.2589176	0.7410824
0.2229491	0.7770509
0.7250873	0.2749127
0.2045018	0.7954982
0.5227361	0.4772639
0.6984755	0.3015245
0.9451696	0.0548304
0.7495110	0.2504890
0.5437915	0.4562085
0.8936105	0.1063895
0.8801705	0.1198295
0.5749465	0.4250535
0.1985768	0.8014232
0.5158127	0.4841873
0.3946228	0.6053772
0.4198336	0.5801664
0.7667393	0.2332607
0.6885877	0.3114123
0.7485346	0.2514654
0.7734899	0.2265101
0.4741566	0.5258434
0.7234942	0.2765058
0.6614630	0.3385370
0.6388283	0.3611717
0.9638214	0.0361786
0.9564810	0.0435190
0.9747515	0.0252485
0.7013611	0.2986389
0.8237832	0.1762168
0.8230671	0.1769329
0.8361796	0.1638204
0.3609223	0.6390777
0.9314637	0.0685363
1.0000000	0.0000000
0.9872173	0.0127827
0.8849667	0.1150333
0.7556975	0.2443025
0.4793520	0.5206480
0.1774030	0.8225970
0.3876311	0.6123689
0.7149009	0.2850991
0.6086918	0.3913082
0.9891669	0.0108331
0.6553708	0.3446292
0.5066759	0.4933241
0.3297095	0.6702905
0.5893879	0.4106121
0.8576160	0.1423840
0.3487322	0.6512678
0.5612598	0.4387402
0.5076853	0.4923147
0.6118926	0.3881074
0.8130791	0.1869209
0.9353724	0.0646276
0.3953734	0.6046266
0.3413585	0.6586415
0.8023120	0.1976880
0.3048340	0.6951660
0.8489235	0.1510765
0.4301030	0.5698970
0.9867436	0.0132564
0.7748167	0.2251833
0.9122985	0.0877015
0.5250530	0.4749470
0.7520944	0.2479056
0.9440630	0.0559370
0.7773578	0.2226422
0.6809456	0.3190544
0.7939638	0.2060362
0.7197433	0.2802567
0.8698642	0.1301358
0.3412808	0.6587192
0.6896806	0.3103194
0.2494882	0.7505118
0.4877538	0.5122462
0.3818725	0.6181275
0.3614687	0.6385313
0.3932880	0.6067120
0.9871871	0.0128129
0.8576737	0.1423263
0.5192008	0.4807992
0.9822320	0.0177680
0.4172219	0.5827781
0.8728709	0.1271291
0.5081818	0.4918182
0.8257106	0.1742894
0.9365307	0.0634693
0.5971244	0.4028756
0.4113292	0.5886708
0.6094391	0.3905609
0.6020694	0.3979306
0.3676342	0.6323658
0.3689307	0.6310693
0.7807175	0.2192825
0.7265910	0.2734090
0.3575844	0.6424156
0.8275538	0.1724462
0.9983830	0.0016170
0.5225489	0.4774511
0.9262741	0.0737259
0.7564699	0.2435301
0.3502089	0.6497911
0.8778747	0.1221253
0.9823994	0.0176006
0.6368512	0.3631488
0.3713030	0.6286970
0.2901606	0.7098394
0.5948032	0.4051968
0.7424490	0.2575510
0.9714052	0.0285948
0.8217538	0.1782462
0.5850126	0.4149874
0.7751345	0.2248655
0.8788189	0.1211811
0.9127589	0.0872411
0.7189324	0.2810676
0.8247163	0.1752837
0.9944897	0.0055103
0.5635365	0.4364635
0.7081489	0.2918511
0.8457624	0.1542376
0.8217713	0.1782287
0.9361965	0.0638035
0.6571396	0.3428604
0.9469687	0.0530313
0.6679039	0.3320961
0.3662390	0.6337610
0.1353950	0.8646050
0.7203299	0.2796701
0.8037103	0.1962897
0.5801313	0.4198687
0.8777692	0.1222308
0.1849274	0.8150726
0.7574552	0.2425448
0.7035945	0.2964055
0.7101104	0.2898896
0.9397339	0.0602661
0.8697660	0.1302340
0.5530858	0.4469142
0.9564711	0.0435289
1.0000000	0.0000000
0.7472861	0.2527139
0.4898438	0.5101562
0.8777135	0.1222865
0.8449216	0.1550784
0.8064412	0.1935588
0.4220825	0.5779175
0.7634547	0.2365453
0.9900348	0.0099652
0.4466670	0.5533330
0.8184308	0.1815692
0.4734109	0.5265891
0.5385043	0.4614957
0.8468837	0.1531163
0.7034853	0.2965147
0.7603807	0.2396193
0.9993026	0.0006974
0.9551000	0.0449000
0.3168629	0.6831371
0.6673877	0.3326123
0.8571452	0.1428548
0.4708052	0.5291948
0.7951256	0.2048744
0.3224132	0.6775868
0.8161645	0.1838355
0.3993722	0.6006278
0.2431834	0.7568166
0.9704583	0.0295417
0.5185468	0.4814532
0.5144539	0.4855461
0.5097621	0.4902379
0.3182060	0.6817940
0.7438338	0.2561662
0.3375915	0.6624085
0.3803587	0.6196413
0.9156769	0.0843231
0.9375244	0.0624756
0.4140165	0.5859835
0.1605616	0.8394384
0.9315907	0.0684093
0.3546898	0.6453102
0.3867605	0.6132395
0.9869486	0.0130514
0.3354193	0.6645807
0.7941213	0.2058787
0.8771686	0.1228314
0.7861615	0.2138385
0.3894639	0.6105361
0.3702634	0.6297366
0.9504416	0.0495584
0.9890958	0.0109042
0.2164738	0.7835262
0.7923666	0.2076334
0.5786187	0.4213813
0.6179173	0.3820827
0.9922112	0.0077888
0.2358606	0.7641394
0.9460469	0.0539531
0.7677454	0.2322546
0.4327615	0.5672385
0.3741935	0.6258065
0.9285794	0.0714206
1.0000000	0.0000000
0.5129678	0.4870322
0.4834312	0.5165688
0.5588346	0.4411654
0.2190966	0.7809034
0.7722984	0.2277016
0.9466899	0.0533101
0.7672096	0.2327904
0.6259445	0.3740555
0.4040882	0.5959118
0.7334570	0.2665430
0.7655839	0.2344161
0.5342878	0.4657122
0.4335564	0.5664436
0.7687409	0.2312591
0.8115833	0.1884167
0.7594367	0.2405633
0.8725719	0.1274281
0.7237934	0.2762066
0.3273091	0.6726909
1.0000000	0.0000000
0.4463396	0.5536604
0.8991056	0.1008944
0.6660197	0.3339803
0.8026648	0.1973352
0.1318157	0.8681843
0.6931784	0.3068216
0.6951226	0.3048774
0.9663438	0.0336562
0.6438646	0.3561354
0.7236637	0.2763363
0.5234890	0.4765110
0.3949860	0.6050140
0.2872608	0.7127392
0.4553043	0.5446957
0.8604079	0.1395921
0.7164289	0.2835711
0.3672569	0.6327431
0.3080166	0.6919834
0.7823454	0.2176546
0.5089942	0.4910058
0.3710407	0.6289593
0.6447013	0.3552987
0.6476921	0.3523079
0.8090871	0.1909129
0.2897449	0.7102551
0.2573655	0.7426345
0.5065743	0.4934257
0.4973454	0.5026546
0.9151375	0.0848625
0.2585220	0.7414780
0.3137697	0.6862303
0.8815412	0.1184588
0.5165378	0.4834622
0.2887595	0.7112405
0.7948187	0.2051813
0.8673883	0.1326117
0.4570330	0.5429670
0.8194691	0.1805309
0.7722619	0.2277381
0.4381033	0.5618967
0.4992987	0.5007013
0.8211480	0.1788520
0.4883085	0.5116915
0.8265114	0.1734886
0.2396470	0.7603530
0.3005724	0.6994276
0.7750899	0.2249101
0.4824107	0.5175893
0.7895885	0.2104115
0.3597846	0.6402154
0.7098168	0.2901832
0.5684569	0.4315431
0.8392271	0.1607729
0.4057890	0.5942110
0.7320033	0.2679967
0.3125608	0.6874392
0.7389559	0.2610441
0.9512146	0.0487854
0.6139107	0.3860893
1.0000000	0.0000000
0.4126807	0.5873193
0.8064383	0.1935617
0.8676532	0.1323468
0.7170095	0.2829905
0.6036759	0.3963241
0.4682400	0.5317600
0.2686380	0.7313620
0.3545171	0.6454829
0.9997057	0.0002943
0.6403927	0.3596073
0.5532675	0.4467325
0.1486354	0.8513646
1.0000000	0.0000000
0.4000965	0.5999035
0.7200523	0.2799477
0.5479540	0.4520460
0.7509986	0.2490014
0.6548618	0.3451382
0.7005645	0.2994355
0.6743960	0.3256040
0.9826967	0.0173033
0.1249561	0.8750439
0.6953989	0.3046011
0.6990534	0.3009466
0.7855017	0.2144983
0.6756954	0.3243046
0.7343849	0.2656151
0.6541569	0.3458431
0.4536366	0.5463634
0.8741733	0.1258267
0.7701916	0.2298084
0.5988770	0.4011230
0.5356322	0.4643678
0.9249948	0.0750052
0.3293119	0.6706881
0.7860330	0.2139670
0.8296150	0.1703850
0.2843814	0.7156186
1.0000000	0.0000000
0.8978159	0.1021841
0.8419341	0.1580659
0.7521686	0.2478314
0.8115936	0.1884064
0.9880707	0.0119293
0.4534803	0.5465197
0.8082856	0.1917144
0.7272316	0.2727684
0.1004556	0.8995444
0.7866387	0.2133613
0.9493692	0.0506308
0.9877911	0.0122089
0.9994863	0.0005137
0.4839945	0.5160055
0.4969700	0.5030300
0.7196644	0.2803356
0.5965981	0.4034019
0.7678521	0.2321479
0.9652011	0.0347989
0.4198042	0.5801958
0.7553829	0.2446171
0.7660610	0.2339390
0.6152015	0.3847985
0.4588119	0.5411881
0.8033645	0.1966355
0.5552260	0.4447740
0.2029033	0.7970967
0.6811544	0.3188456
0.8758827	0.1241173
0.4315573	0.5684427
0.9840478	0.0159522
0.8752037	0.1247963
0.2154201	0.7845799
0.9046823	0.0953177
0.9576644	0.0423356
0.7443416	0.2556584
0.6105802	0.3894198
0.3131720	0.6868280
0.8463113	0.1536887
0.7255025	0.2744975
0.5182707	0.4817293
0.3189069	0.6810931
0.4541079	0.5458921
0.9349405	0.0650595
0.6709060	0.3290940
0.6246675	0.3753325
0.9342970	0.0657030
0.9880141	0.0119859
0.2169591	0.7830409
0.6866609	0.3133391
0.8092495	0.1907505
0.6330750	0.3669250
0.6217990	0.3782010
0.9716277	0.0283723
0.3634945	0.6365055
0.4766357	0.5233643
0.4392632	0.5607368
0.9862878	0.0137122
0.4409559	0.5590441
0.2829687	0.7170313
0.4772767	0.5227233
0.7896336	0.2103664
0.5120947	0.4879053
0.1146456	0.8853544
0.9727512	0.0272488
0.7171907	0.2828093
0.6092091	0.3907909
0.7895809	0.2104191
0.2844713	0.7155287
0.8726368	0.1273632
0.8193421	0.1806579
0.9895863	0.0104137
0.5522083	0.4477917
0.8234543	0.1765457
0.8056758	0.1943242
0.7710204	0.2289796
0.9490031	0.0509969
0.8074944	0.1925056
0.6579108	0.3420892
0.7639545	0.2360455
0.6481976	0.3518024
0.2941543	0.7058457
0.6809186	0.3190814
0.3561624	0.6438376

svmtestpred<-cbind(svm_Pred,SVMTest)

svmtestpred<-svmtestpred%>%
  mutate(prediction=if_else(high>=.5,"high","low"))

confusionMatrix(factor(svmtestpred$prediction),factor(svmtestpred$income2))

Confusion Matrix and Statistics

          Reference
Prediction high low
      high  284  82
      low   116 318
                                          
               Accuracy : 0.7525          
                 95% CI : (0.7211, 0.7821)
    No Information Rate : 0.5             
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.505           
                                          
 Mcnemar's Test P-Value : 0.01902         
                                          
            Sensitivity : 0.7100          
            Specificity : 0.7950          
         Pos Pred Value : 0.7760          
         Neg Pred Value : 0.7327          
             Prevalence : 0.5000          
         Detection Rate : 0.3550          
   Detection Prevalence : 0.4575          
      Balanced Accuracy : 0.7525          
                                          
       'Positive' Class : high

Support Vector Classification: Unbalanced Classes

So far, we have dealt with classification tasks where we have sufficient examples of all classes in our training data set. Sometimes, however, we are given data sets that are unbalanced, where one or more classes are underrepresented in the training data. In general, this can become very problematic, and can lead to subtle biases that might be difficult to find until it is too late.

A classic example where unbalanced classes can arise is in fraud detection. For a company to remain in business, fraud should be a rare event, ideally well below one percent. Imagine you have been given a set of transactions, and your task is to predict fraud. In this case you might have 9,900 negative examples, and only 100 positive examples. If we simply want to achieve the highest performance model, we can always predict no fraud and our model will be accurate 99% of the time! Clearly this is not appropriate.

This naive approach, known as the zero model, is to always predict the class with the most training labels. While uninformative as a model, it can provide a useful baseline for performance. .

Original <- train(
  form = factor(income2) ~ age+fnlwgt+education_num+factor(occupation)+factor(race)+factor(sex)+
    capital_gain+capital_loss+hours_per_week,
  data = SVMTrain,
  #add roc for AUC
  metric = "ROC",
  #here we add classProbs because we want probs
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE,
                           summaryFunction = twoClassSummary),
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 10)


down_inside<-train(
  form = factor(income2) ~ age+fnlwgt+education_num+factor(occupation)+factor(race)+factor(sex)+
    capital_gain+capital_loss+hours_per_week,
  data = SVMTrain,
  #add roc for AUC
  metric = "ROC",
  #here we add classProbs because we want probs
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE,
                           summaryFunction = twoClassSummary,
                           sampling = "down"),
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 10)

up_inside<-train(
  form = factor(income2) ~ age+fnlwgt+education_num+factor(occupation)+factor(race)+factor(sex)+
    capital_gain+capital_loss+hours_per_week,
  data = SVMTrain,
  #add roc for AUC
  metric = "ROC",
  #here we add classProbs because we want probs
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE,
                           summaryFunction = twoClassSummary,
                           sampling = "up"),
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 10)

smote_inside<-train(
  form = factor(income2) ~ age+fnlwgt+education_num+factor(occupation)+factor(race)+factor(sex)+
    capital_gain+capital_loss+hours_per_week,
  data = SVMTrain,
  #add roc for AUC
  metric = "ROC",
  #here we add classProbs because we want probs
  trControl = trainControl(method = "cv", number = 10,
                           classProbs =  TRUE,
                           summaryFunction = twoClassSummary,
                           sampling = "smote"),
  method = "svmRadial",
  preProcess = c("center", "scale"),
  tuneLength = 10)

inside_models <- list(original = Original,
                      down = down_inside,
                      up = up_inside,
                      SMOTE = smote_inside)

inside_resampling <- resamples(inside_models)

summary(inside_resampling, metric = "ROC")


Call:
summary.resamples(object = inside_resampling, metric = "ROC")

Models: original, down, up, SMOTE 
Number of resamples: 10 

ROC 
              Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
original 0.7650000 0.8059722 0.8168056 0.8150278 0.8281944 0.8558333
down     0.7452778 0.7890972 0.8076389 0.8141389 0.8334722 0.9155556
up       0.7658333 0.7908333 0.8191667 0.8175278 0.8303472 0.8972222
SMOTE    0.7683333 0.8053472 0.8233333 0.8192778 0.8384028 0.8475000
         NA's
original    0
down        0
up          0
SMOTE       0

test_roc <- function(model, data) {
  library(pROC)
  roc_obj <- roc(data$income2, 
                 predict(model, data, type = "prob")[, "high"],
                 levels = c("low", "high"))
  ci(roc_obj)
  }

inside_test <- lapply(inside_models, test_roc, data = SVMTest)
inside_test <- lapply(inside_test, as.vector)
inside_test <- do.call("rbind", inside_test)
colnames(inside_test) <- c("lower", "ROC", "upper")
inside_test <- as.data.frame(inside_test)

knitr::kable(inside_test)

	lower	ROC	upper
original	0.8058726	0.8331250	0.8603774
down	0.7975653	0.8254562	0.8533472
up	0.8048410	0.8321875	0.8595340
SMOTE	0.8054667	0.8327563	0.8600458

The ROC Curve and AUC

While the standard performance metrics are useful for a standard classification task, many algorithms now generate a probabilistic classification. As a result, we need a method to not only compare different estimators, but determine the optimal threshold for an estimator. To support this decision, we employ the receiver operating characteristic (ROC) curve. Originally developed during World War II to predict the performance of an individual using a radar system, the ROC curve displays the relationship between the number of false positives (along the x-axis) and true positives (along the y-axis) as a function of probability threshold.

The ROC curve starts at the lower left, where nothing has been classified. From here the estimator is used to determine the true and false positives for very high probability thresholds. At this point, the curve should shoot upward from the lower left, wince we expect a good classifier (and what other type of classifier would we build) performs well at high threshold. In general, as the probability threshold is lowered, we will begin to predict more false positives, and thus the curve will shift to the right.

To generate a ROC curve, we need to create arrays of false and true positives at different probability thresholds. Given an ROC curve, another performance metric that can be measured is the area under the curve or AUC. In an ideal case this metric has the value of one, or perfect classification, and a random classification has the value of 0.5. This metric can provide a useful comparison between different estimators on the same data.

origroc<-pROC::roc(SVMTest$income2, 
                 predict(Original, SVMTest, type = "prob")[, "high"],
                 levels = c("low", "high"))

downroc<-pROC::roc(SVMTest$income2, 
                 predict(down_inside, SVMTest, type = "prob")[, "high"],
                 levels = c("low", "high"))

uproc<-pROC::roc(SVMTest$income2, 
                 predict(up_inside, SVMTest, type = "prob")[, "high"],
                 levels = c("low", "high"))

smoteroc<-pROC::roc(SVMTest$income2, 
                 predict(smote_inside, SVMTest, type = "prob")[, "high"],
                 levels = c("low", "high"))



pROC::ggroc(list(origroc,downroc,uproc,smoteroc))

When interpreting a ROC curve, the goal is to approximate as closely as possible to the perfect classifier, which reaches 100% true positive at zero false positive. On the other hand, an estimator should perform better than the baseline, which is essentially a random guess.

The Gain and Lift Charts

Gain

While the ROC curve provides a lot of detail into the performance of an estimator, and allows the performance of different estimators to be compared, we will sometimes want a different performance metric. For example, we may not have a sufficient budget to target all customers on which we have data. Thus, we can build a classifier to estimate which of our customers makes more than fifty thousand dollars a year. However, even this subset might be too large. What we really want is a way to select those instances from our classified data that we believe have the highest likelihood of being in our target category. Two charts that are useful to accomplish this goal are the gain chart and the related lift chart.

A classic example of where a lift chart is often used is in marketing. If we have a limited budget, we want to optimally target customers who will respond positively to ads or a marketing campaign. As a result, we would build a model and compute lift chart. This chart can be used to infer a cutoff point in the sample, above which we have the highest likelihood of a positive response. This will allow us to target customers optimally given a limited budget.

An alternative example is preventing customer churn, which is where customers migrate from one company to another. We can use a classifier to predict how likely a customer is to churn. With a lift chart, we can identify those customers most likely to churn and focus our limited retention budget on keeping them with our company.

pred<-ROCR::prediction(svmtestpred$high ,factor(svmtestpred$income2))

gain <- ROCR::performance(pred, "tpr", "rpp")

plot(gain, main = "Gain Chart")

Lift

In the following Code chunk, we first compute the lift curve for our three different classification estimators. In this case, the lift is simply the gain for the estimator divided by the baseline response. To interpret the lift curve, recall that the value of the lift curve at any point indicates the relative improvement of our model over random.

From this curve, we can see that our default logistic regression and support vector classification algorithms both perform very well. But more importantly, if this classification was being used to target customers, we can optimize our results and use less money by targeting those customers who fall to the left in this chart. In some cases, you may see these curves converted into a profit curve, where the value of the prediction is included. In this example, we might assume a cost per targeted ad to convert the lift curve into a cost curve, which would allow us to determine how many (and who) of the individuals we should target given a budget. However, these types of curves are generally domain specific, so we do not present them in this notebook.

pred<-ROCR::prediction(svmtestpred$high ,factor(svmtestpred$income2))

perf <- ROCR::performance(pred,"lift","rpp")

plot(perf, main="Lift curve", colorize=T)

Another explanation

Don’t be afraid to search for better explanations.