Introduction to Bagging and Random Forest

In this module, we introduce the concept of *bagging*, which is shorthand for bootstrap aggregation, where random samples of the data are used to construct multiple decision trees. Since each tree only sees part of the data, each tree is less accurate than if it had been constructed over the full data set. Thus, each tree is known as a *weak learner*. A more powerful, meta-estimator is subsequently constructed by averaging over these many weak learners. The approach of constructing weak learners, and combining them into a more powerful estimator, is at the heart of several, very powerful machine learning techniques, including the random forest.

We first introduce the formalism behind bagging, including a discussion of the concept of bootstrapping. Next, we move on to a discussion of the random forest algorithm, which will include its application to both classification and regression tasks.

One of the simplest machine learning algorithms to understand is the decision tree. Often, a decision tree is made as large as possible to provide the best predictive model, as this produces a high purity in the leaf nodes. Doing so, however, can lead to overfitting where the model predicts very accurately on the training data but fails to generalize to the test data; the accuracy is, as a result, much lower.

A simple approach to overcoming the overfitting problem is to train many decision trees on a subset of the data and to average the resulting predictions. This process is known as bootstrap aggregation, which is often shortened to bagging. Of these two terms, aggregation is simple to understand, one simply aggregates (or averages) the predictions of the many trees.

The term bootstrap is a statistical term that defines how a sample can be constructed from an original data set. Given a data set, there are two simple ways to construct a new sample. As a specific example, consider building a list of shows you wish to watch from an online provider like Netflix or Amazon by placing them in a virtual cart. In the first approach, you take a show of the virtual shelf and place it in your cart. This is known as sampling without replacement since the show is only present in your cart. In the second approach, you take a show and place it in your cart, but there remains a copy of the show on the virtual shelf. This is known as sampling with replacement, since we replace the original instance.

Sampling with replacement has several advantages that make it important for machine learning. First, we can construct many large samples from our original data set, where each sample is not limited by the size of the original data set. For example, if our original data set contained 100 entries, sampling without replacement would mean we could only create ten new samples that each had ten entries. On the other hand, sampling with replacement means we could create 100 (or more) new samples that each have ten (or more) entries.

Building many samples from a parent population allows us to build an estimator on each sample and average (or aggregate) the results. This is demonstrated in the following figure, where an original data set is used to train a number of decision trees. In this case, each tree is constructed from a bootstrap sample of the original data set. The predictions from these trees are aggregated at the end to make a final prediction.

Beyond improved prediction, bagging algorithms provide an additional benefit. Since each tree (or other learning algorithm in the case of a *Bagging* estimator) is constructed from a subsample of the original data, the performance of that tree can be tested on the data from the original data that were not used in its construction. These data are known as out-of-bag data, and provide a useful metric for the performance of each individual tree used in the ensemble.

Before introducing the random forest, we first explore the construction and use of bootstrap samples.

Formally, a bootstrap refers to any statistical process that relies on the generation of random samples with replacement. To demonstrate the benefit of the bootstrap, we will bootstrap the `size`

feature from the *tips* data set, which is the number of patrons served by the restaurant for a meal.

**Bootstrapping** is a method that can be used to estimate the standard error of any statistic and produce a confidence interval for the statistic.

The basic process for bootstrapping is as follows:

- Take
*k*repeated samples with replacement from a given dataset. - For each sample, calculate the statistic youâ€™re interested in.
- This results inÂ
*k*different estimates for a given statistic, which you can then use to calculate the standard error of the statistic and create a confidence interval for the statistic.

We can use the rsample package:

```
install.packages("rsample")
```

The initial power of the bootstrap results from our ability to extend the creation of this one sample to many.

```
library(curl)
load(curl("https://raw.githubusercontent.com/Professor-Hunt/ACC8143/main/data/tips.rda"))
set.seed(0)
library(rsample)
library(tidyverse)
#perform bootstrapping with 2000 replications
resample1 <- bootstraps(as.data.frame(tips$size), times = 100)
#view results of boostrapping
knitr::kable(head(summary(resample1),5))
```

splits.Length splits.Class splits.Mode | id | |
---|---|---|

4 boot_split list | Length:100 | |

4 boot_split list | Class :character | |

4 boot_split list | Mode :character | |

4 boot_split list | NA | |

4 boot_split list | NA |

```
#info for a specific sample
resample1$splits[[1]]
```

```
<Analysis/Assess/Total>
<244/91/244>
```

```
#mean
mean(resample1$splits[[1]]$data$`tips$size`)
```

`[1] 2.569672`

```
#standard deviation
sd(resample1$splits[[1]]$data$`tips$size`)
```

`[1] 0.9510998`

Of course, we do not need to compute these statistics across the entire sample, we can compute the mean for each sample, creating an array of means. In this case, we can consider each sample mean to be an estimate of the mean of the parent population. We can average these means (i.e., aggregate) these sample means to provide an estimate of the population mean, along with a measure of the uncertainty in this estimate, by computing the standard deviation of our sample means.

```
#get all of them
mean_values<-purrr::map_dbl(resample1$splits,
function(x) {
dat <- as.data.frame(x)$`tips$size`
mean(dat)
})
#view the whole dataset
knitr::kable(mean_values)%>%
kableExtra::kable_styling("striped")%>%
kableExtra::scroll_box(width = "50%",height="300px")
```

x |
---|

2.553279 |

2.536885 |

2.524590 |

2.495902 |

2.512295 |

2.692623 |

2.520492 |

2.565574 |

2.491803 |

2.565574 |

2.635246 |

2.581967 |

2.581967 |

2.532787 |

2.512295 |

2.565574 |

2.627049 |

2.647541 |

2.565574 |

2.553279 |

2.545082 |

2.532787 |

2.500000 |

2.545082 |

2.540984 |

2.532787 |

2.524590 |

2.565574 |

2.573771 |

2.545082 |

2.581967 |

2.536885 |

2.561475 |

2.520492 |

2.577869 |

2.635246 |

2.549180 |

2.631147 |

2.569672 |

2.504098 |

2.532787 |

2.610656 |

2.540984 |

2.590164 |

2.606557 |

2.471312 |

2.602459 |

2.622951 |

2.512295 |

2.516393 |

2.577869 |

2.655738 |

2.553279 |

2.528688 |

2.586066 |

2.668033 |

2.622951 |

2.590164 |

2.725410 |

2.709016 |

2.520492 |

2.594262 |

2.581967 |

2.536885 |

2.590164 |

2.573771 |

2.536885 |

2.569672 |

2.717213 |

2.450820 |

2.434426 |

2.540984 |

2.540984 |

2.565574 |

2.696721 |

2.475410 |

2.577869 |

2.627049 |

2.577869 |

2.520492 |

2.639344 |

2.668033 |

2.540984 |

2.471312 |

2.606557 |

2.618853 |

2.573771 |

2.553279 |

2.663934 |

2.553279 |

2.610656 |

2.569672 |

2.540984 |

2.606557 |

2.487705 |

2.606557 |

2.545082 |

2.524590 |

2.553279 |

2.524590 |

```
#estimate of the population mean
mean(mean_values)
```

`[1] 2.568484`

```
#get all of them
sd_values<-purrr::map_dbl(resample1$splits,
function(x) {
dat <- as.data.frame(x)$`tips$size`
sd(dat)
})
#view the whole dataset
knitr::kable(sd_values)%>%
kableExtra::kable_styling("striped")%>%
kableExtra::scroll_box(width = "50%",height="300px")
```

x |
---|

0.9391077 |

0.9041924 |

0.9000513 |

0.8725251 |

0.9664391 |

1.0500769 |

0.9662995 |

0.9250755 |

0.9049103 |

0.9978389 |

0.9396105 |

1.0331156 |

0.9672851 |

0.9616894 |

0.9361578 |

0.9025588 |

0.9319687 |

1.0297799 |

1.0382615 |

0.9214128 |

1.0433093 |

0.8671834 |

0.9184886 |

0.8946666 |

0.8856194 |

0.9994602 |

0.9226292 |

0.9161352 |

0.9288963 |

0.8713645 |

1.0048456 |

0.9615403 |

0.9981008 |

0.9956646 |

1.0051728 |

0.9782341 |

0.9216416 |

0.9401937 |

1.0458926 |

0.8725251 |

0.9179009 |

0.9072463 |

0.8809604 |

0.9876113 |

0.9476623 |

0.8529763 |

0.9695753 |

1.0369937 |

0.8436720 |

0.9915993 |

0.8970012 |

1.0405008 |

0.9214128 |

0.8673292 |

0.9796124 |

1.0543085 |

0.9671805 |

0.9362749 |

1.0669029 |

1.1080331 |

0.9403999 |

0.9830497 |

0.9924834 |

0.8622586 |

0.9665526 |

0.9420932 |

0.8950436 |

0.9809200 |

1.1644383 |

0.8423215 |

0.8887750 |

0.9825434 |

0.9130741 |

0.9295134 |

1.0052483 |

0.8954675 |

0.9633628 |

1.0364975 |

0.9590815 |

0.8816780 |

0.9477335 |

0.9981684 |

0.8620727 |

0.8673292 |

0.8940349 |

1.0093254 |

0.9244554 |

0.9169357 |

0.9653303 |

0.8988044 |

0.9686703 |

0.9682524 |

1.0032666 |

1.0107030 |

0.8771519 |

0.9733687 |

0.9307373 |

0.9576473 |

0.9303023 |

0.8815728 |

```
#estimate of the population standard deviation
sd(sd_values)
```

`[1] 0.06087578`

This simple example has demonstrated how bootstrap aggregation, in this case of the sample means, can provide a powerful estimator of a population statistic. In each case, we generate multiple samples with replacement, compute statistics across these samples, and aggregate the result at the end. This concept underlies all bagging estimators.

- Redo the bootstrap analysis, but use 10 samples. Change
`times = 100`

to`times = 10`

. How does the population estimate for mean change? - Redo the bootstrap analysis, but find the median. What is the difference in population estimate for the median vs mean?

A random forest employs bagging to create a set of decision trees from a given data set. Each tree is constructed from a bootstrap sample, and the final prediction is generated by aggregating the predictions of the individual trees, just like the previous code example demonstrated by using the mean of the sample means to estimate the mean of the parent population. However, the random forest introduces one additional random concept into the tree construction.

Normally, when deciding on a split point during the construction of a decision tree, all features are evaluated and the one that has the highest impurity (or produces the largest information gain) is selected as the feature on which to split, along with the value at which to split that feature. In a random forest, a random subset of all features is used to make the split choice, and the best feature on which to split is selected form this subset.

This extra randomness produces individual decision trees that are less sensitive to small-scale fluctuations, which is known as under-fitting. As a result, each newly created decision tree is a weak learner since they are not constructed from all available information. Yet, since each decision tree is constructed from different sets of features, by aggregating their predictions, the final random forest prediction is improved and less affected by overfitting.

Each tree in the random forest is constructed from a different combination of features. As a result, we can use the *out-of-bag* performance from each tree to rank the importance of the features used to construct the trees in the forest. This allows for robust estimates of feature importance to be computed after constructing a random forest, which can provide useful insight into the nature of a training data set.

Having completed the discussion on bootstrap aggregation, and introduced the random forest algorithm, we can now transition to putting this powerful ensemble algorithm to work.

```
set.seed(1)
#lets split the data 60/40
library(caret)
trainIndex <- createDataPartition(iris$Species, p = .6, list = FALSE, times = 1)
#grab the data
irisTrain <- iris[ trainIndex,]
irisTest <- iris[-trainIndex,]
ggplot(data=irisTrain)+geom_point(mapping = aes(x=Petal.Length,y=Petal.Width,color=Species),alpha=0.5) + labs(color = "Training Species")+
geom_point(data=irisTest, ,mapping = aes(x=Petal.Length,y=Petal.Width,shape=Species)) + labs(shape = "Testing Species") +
ggtitle("The data")+
theme(plot.title = element_text(hjust=0.5, size=10, face='bold'))
```

The model

```
set.seed(1)
IrisRF<- train(
form = factor(Species) ~ .,
data = irisTrain,
#here we add classProbs because we want probs
trControl = trainControl(method = "cv", number = 10,
classProbs = TRUE),
method = "rf",
tuneLength = 3)#why 3?
IrisRF
```

```
Random Forest
90 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 81, 81, 81, 81, 81, 81, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.9666667 0.95
3 0.9666667 0.95
4 0.9666667 0.95
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 2.
```

```
summary(IrisRF)
```

```
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 90 factor numeric
err.rate 2000 -none- numeric
confusion 12 -none- numeric
votes 270 matrix numeric
oob.times 90 -none- numeric
classes 3 -none- character
importance 4 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 90 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
xNames 4 -none- character
problemType 1 -none- character
tuneValue 1 data.frame list
obsLevels 3 -none- character
param 0 -none- list
```

```
IrisRF_Pred<-predict(IrisRF,irisTest,type="prob")
knitr::kable(IrisRF_Pred)%>%
kableExtra::kable_styling("striped")%>%
kableExtra::scroll_box(width = "50%",height="300px")
```

setosa | versicolor | virginica | |
---|---|---|---|

2 | 0.996 | 0.004 | 0.000 |

5 | 1.000 | 0.000 | 0.000 |

8 | 1.000 | 0.000 | 0.000 |

11 | 0.998 | 0.002 | 0.000 |

13 | 0.996 | 0.004 | 0.000 |

16 | 0.964 | 0.036 | 0.000 |

17 | 0.998 | 0.002 | 0.000 |

19 | 0.964 | 0.036 | 0.000 |

24 | 1.000 | 0.000 | 0.000 |

28 | 1.000 | 0.000 | 0.000 |

30 | 1.000 | 0.000 | 0.000 |

32 | 0.994 | 0.006 | 0.000 |

35 | 1.000 | 0.000 | 0.000 |

37 | 0.940 | 0.060 | 0.000 |

38 | 1.000 | 0.000 | 0.000 |

42 | 0.946 | 0.050 | 0.004 |

44 | 1.000 | 0.000 | 0.000 |

45 | 1.000 | 0.000 | 0.000 |

46 | 0.996 | 0.004 | 0.000 |

50 | 1.000 | 0.000 | 0.000 |

53 | 0.000 | 0.686 | 0.314 |

55 | 0.000 | 0.956 | 0.044 |

59 | 0.000 | 0.956 | 0.044 |

60 | 0.006 | 0.982 | 0.012 |

63 | 0.000 | 0.942 | 0.058 |

65 | 0.000 | 1.000 | 0.000 |

66 | 0.002 | 0.954 | 0.044 |

70 | 0.000 | 0.998 | 0.002 |

71 | 0.004 | 0.112 | 0.884 |

76 | 0.000 | 0.956 | 0.044 |

77 | 0.000 | 0.814 | 0.186 |

78 | 0.000 | 0.114 | 0.886 |

81 | 0.000 | 1.000 | 0.000 |

83 | 0.000 | 1.000 | 0.000 |

84 | 0.000 | 0.448 | 0.552 |

85 | 0.058 | 0.894 | 0.048 |

86 | 0.014 | 0.968 | 0.018 |

87 | 0.002 | 0.950 | 0.048 |

95 | 0.000 | 1.000 | 0.000 |

97 | 0.000 | 1.000 | 0.000 |

103 | 0.000 | 0.002 | 0.998 |

104 | 0.000 | 0.010 | 0.990 |

105 | 0.000 | 0.000 | 1.000 |

106 | 0.000 | 0.000 | 1.000 |

108 | 0.000 | 0.000 | 1.000 |

112 | 0.000 | 0.000 | 1.000 |

114 | 0.000 | 0.034 | 0.966 |

115 | 0.000 | 0.000 | 1.000 |

124 | 0.000 | 0.056 | 0.944 |

127 | 0.000 | 0.132 | 0.868 |

130 | 0.000 | 0.446 | 0.554 |

134 | 0.000 | 0.480 | 0.520 |

135 | 0.000 | 0.496 | 0.504 |

136 | 0.000 | 0.000 | 1.000 |

137 | 0.000 | 0.004 | 0.996 |

138 | 0.000 | 0.008 | 0.992 |

142 | 0.000 | 0.002 | 0.998 |

144 | 0.000 | 0.002 | 0.998 |

145 | 0.000 | 0.002 | 0.998 |

147 | 0.000 | 0.048 | 0.952 |

```
irisrftestpred<-cbind(IrisRF_Pred,irisTest)
irisrftestpred<-irisrftestpred%>%
mutate(prediction=if_else(setosa>versicolor & setosa>virginica,"setosa",
if_else(versicolor>setosa & versicolor>virginica, "versicolor",
if_else(virginica>setosa & virginica>versicolor,"virginica", "PROBLEM"))))
table(irisrftestpred$prediction)
```

```
setosa versicolor virginica
20 17 23
```

```
confusionMatrix(factor(irisrftestpred$prediction),factor(irisrftestpred$Species))
```

```
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 20 0 0
versicolor 0 17 0
virginica 0 3 20
Overall Statistics
Accuracy : 0.95
95% CI : (0.8608, 0.9896)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.925
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8500 1.0000
Specificity 1.0000 1.0000 0.9250
Pos Pred Value 1.0000 1.0000 0.8696
Neg Pred Value 1.0000 0.9302 1.0000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.2833 0.3333
Detection Prevalence 0.3333 0.2833 0.3833
Balanced Accuracy 1.0000 0.9250 0.9625
```

As the previous example demonstrated, the random forest is easy to use and often provides impressive results. In addition, by its very nature, a random forest provides an implicit measure of the importance of the individual features in generating the final predictions. While an individual decision tree provides this information, the random forest provides an aggregated result that is generally more insightful and less sensitive to fluctuations in the training data that might bias the importance values determined by a decision tree. In the calculation of feature importance from a random forest, higher values indicate a more important feature.

Overall | |
---|---|

Petal.Width | 100.00000 |

Petal.Length | 85.27446 |

Sepal.Length | 14.94842 |

Sepal.Width | 0.00000 |

- Use the tips data and predict
`sex`

. Use only continuous variables. What are the results? - Use the tips data and predict
`sex`

. Use all of the variables. Did the model improve from 1? - What are the most important variables from 2?

```
set.seed(1)
#lets split the data 60/40
library(caret)
trainIndex <- createDataPartition(iris$Species, p = .6, list = FALSE, times = 1)
#grab the data
train <- iris[ trainIndex,]
test <- iris[-trainIndex,]
IrisRF<- train(
form = factor(Species) ~ .,
data = train,
#here we add classProbs because we want probs
trControl = trainControl(method = "cv", number = 10,
classProbs = TRUE),
method = "rf",
tuneGrid=data.frame(mtry=1))
pl = seq(min(iris$Petal.Length), max(iris$Petal.Length), by=0.1)
pw = seq(min(iris$Petal.Width), max(iris$Petal.Width), by=0.1)
# generates the boundaries for your graph
lgrid <- expand.grid(Petal.Length=pl,
Petal.Width=pw,
Sepal.Length = 5.4,
Sepal.Width=3.1)
IrisRFGrid2 <- predict(IrisRF, newdata=lgrid)
IrisRFGrid <- as.numeric(IrisRFGrid2)
# get the points from the test data...
testPred <- predict(IrisRF, newdata=test)
testPred <- as.numeric(testPred)
# this gets the points for the testPred...
test$Pred <- testPred
probs <- matrix(IrisRFGrid, length(pl), length(pw))
ggplot(data=lgrid) + stat_contour(aes(x=Petal.Length, y=Petal.Width, z=IrisRFGrid),bins=10) +
geom_point(aes(x=Petal.Length, y=Petal.Width, colour=IrisRFGrid2),alpha=.2) +
geom_point(data=test, aes(x=Petal.Length, y=Petal.Width, shape=Species), size=2) +
labs(shape = "Testing Species") +
geom_point(data=train, aes(x=Petal.Length, y=Petal.Width, color=Species), size=2, alpha=0.75)+
theme_bw()+
labs(color = "Training Species")+
ggtitle("Decision Surface mtry=1")
```

A random forest can also be used to perform regression; however, in this case the goal is to create trees whose leaf nodes contain data that are nearby in the overall feature space. To predict a continuous value from a tree we either have leaf nodes with only one feature, and use the relevant feature from that instance as our predictor, or we compute summary statistics from the instances in the appropriate leaf node, such as the mean or mode. In the end, the random forest aggregates the individual tree regression predictions into a final prediction.

```
set.seed(1)
IrisRF<- train(
form = Sepal.Width ~ .,
data = irisTrain,
#here we add classProbs because we want probs
trControl = trainControl(method = "cv", number = 10),
method = "rf",
tuneLength = 3)#why 3?
IrisRF
```

```
Random Forest
90 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 81, 81, 80, 80, 81, 81, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.3143813 0.5629816 0.2465506
3 0.3151144 0.5667041 0.2440860
5 0.3250253 0.5507325 0.2510316
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
```

```
summary(IrisRF)
```

```
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 90 -none- numeric
mse 500 -none- numeric
rsq 500 -none- numeric
oob.times 90 -none- numeric
importance 5 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 11 -none- list
coefs 0 -none- NULL
y 90 -none- numeric
test 0 -none- NULL
inbag 0 -none- NULL
xNames 5 -none- character
problemType 1 -none- character
tuneValue 1 data.frame list
obsLevels 1 -none- logical
param 0 -none- list
```

```
IrisRF_Pred<-predict(IrisRF,irisTest)
knitr::kable(IrisRF_Pred)%>%
kableExtra::kable_styling("striped")%>%
kableExtra::scroll_box(width = "50%",height="300px")
```

x | |
---|---|

2 | 3.260807 |

5 | 3.277757 |

8 | 3.276376 |

11 | 3.625285 |

13 | 3.292194 |

16 | 3.780434 |

17 | 3.618677 |

19 | 3.727191 |

24 | 3.635188 |

28 | 3.591418 |

30 | 3.261777 |

32 | 3.696874 |

35 | 3.253701 |

37 | 3.701131 |

38 | 3.291458 |

42 | 3.302038 |

44 | 3.441303 |

45 | 3.635421 |

46 | 3.385235 |

50 | 3.277757 |

53 | 2.909713 |

55 | 2.974078 |

59 | 2.889188 |

60 | 2.528744 |

63 | 2.622728 |

65 | 2.617352 |

66 | 2.897548 |

70 | 2.601872 |

71 | 2.980619 |

76 | 2.887018 |

77 | 3.027017 |

78 | 2.965986 |

81 | 2.457501 |

83 | 2.670909 |

84 | 2.845308 |

85 | 2.657226 |

86 | 2.918538 |

87 | 3.007142 |

95 | 2.857735 |

97 | 2.864092 |

103 | 3.094478 |

104 | 2.919063 |

105 | 3.038040 |

106 | 3.145111 |

108 | 3.058734 |

112 | 2.931478 |

114 | 2.810783 |

115 | 2.952067 |

124 | 2.919760 |

127 | 2.924434 |

130 | 2.848639 |

134 | 2.699231 |

135 | 2.728545 |

136 | 3.257096 |

137 | 3.127867 |

138 | 2.942718 |

142 | 3.114100 |

144 | 3.106644 |

145 | 3.227185 |

147 | 2.875940 |

```
irisrftestpred<-cbind(IrisRF_Pred,irisTest)
#root mean squared error
RMSE(irisrftestpred$IrisRF_Pred,irisrftestpred$Sepal.Width)
```

`[1] 0.254781`

```
#best measure ever...RSquared
cor(irisrftestpred$IrisRF_Pred,irisrftestpred$Sepal.Width)^2
```

`[1] 0.6169456`

- Rerun the regression analysis, but change
`form = Sepal.Width ~ .,`

to`form = Sepal.Width ~ Sepal.Length+Petal.Length+Petal.Width+factor(Species),`

. Does anything change? Why?

```
iristrain_2levels<-irisTrain%>%
filter(Species!="setosa")
set.seed(1)
IrisRF<- train(
form = factor(Species) ~ .,
data = iristrain_2levels,
#here we add classProbs because we want probs
trControl = trainControl(method = "cv", number = 10,
classProbs = TRUE),
method = "rf",
tuneLength = 3)#why 3?
IrisRF
```

```
Random Forest
60 samples
4 predictor
2 classes: 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 54, 54, 54, 54, 54, 54, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.95 0.9
3 0.95 0.9
4 0.95 0.9
Accuracy was used to select the optimal model using the
largest value.
The final value used for the model was mtry = 2.
```

```
summary(IrisRF)
```

```
Length Class Mode
call 4 -none- call
type 1 -none- character
predicted 60 factor numeric
err.rate 1500 -none- numeric
confusion 6 -none- numeric
votes 120 matrix numeric
oob.times 60 -none- numeric
classes 2 -none- character
importance 4 -none- numeric
importanceSD 0 -none- NULL
localImportance 0 -none- NULL
proximity 0 -none- NULL
ntree 1 -none- numeric
mtry 1 -none- numeric
forest 14 -none- list
y 60 factor numeric
test 0 -none- NULL
inbag 0 -none- NULL
xNames 4 -none- character
problemType 1 -none- character
tuneValue 1 data.frame list
obsLevels 2 -none- character
param 0 -none- list
```

```
iristest_2levels<-irisTest%>%
filter(Species!="setosa")
IrisRF_Pred<-predict(IrisRF,iristest_2levels,type="prob")
irisrftestpred<-cbind(IrisRF_Pred,iristest_2levels)
rocobj<-pROC::roc(factor(irisrftestpred$Species), irisrftestpred$versicolor)
rocobj
```

```
Call:
roc.default(response = factor(irisrftestpred$Species), predictor = irisrftestpred$versicolor)
Data: irisrftestpred$versicolor in 20 controls (factor(irisrftestpred$Species) versicolor) > 20 cases (factor(irisrftestpred$Species) virginica).
Area under the curve: 0.9725
```

```
plot(rocobj, colorize=T)
```

```
pred<-ROCR::prediction(irisrftestpred$versicolor ,factor(irisrftestpred$Species))
gain <- ROCR::performance(pred, "tpr", "rpp")
plot(gain, main = "Gain Chart")
```

```
perf <- ROCR::performance(pred,"lift","rpp")
plot(perf, main="Lift curve")
```

fin