Zhiguang Huo (Caleb)
Wednesday October 25, 2017
library(ElemStatLearn)
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.2
afit <- rpart(svi ~ . - train, data = prostate)
rpart.plot(afit)
Purpose: predict whether a student will play cricket in his/her leisure time.
Questions:
Terminology:
Goodness of split (GOS) criteria
Entropy
Assume:
\[M_{Gini}(t) = 1 - P(Y = 0 | X \in t)^2 - P(Y = 1 | X \in t)^2\]
Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:
\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]
If split on Gender: \[M_{Gini}(R) - \frac{10}{30}M_{Gini}(G:F) - \frac{20}{30}M_{Gini}(G:M) = 0.5 - 10/30\times 0.32 - 20/30\times 0.455 = 0.09 \]
If split on Class: \[M_{Gini}(R) - \frac{14}{30}M_{Gini}(C:4) - \frac{16}{30}M_{Gini}(C:5) = 0.5 - 14/30\times 0.4902 - 16/30\times 0.4928 = 0.008 \]
If split on Height:5.5 \[M_{Gini}(R) - \frac{12}{30}M_{Gini}(H:5.5-) - \frac{18}{30}M_{Gini}(H:5.5+) = 0.5 - 12/30*0.4872 - 18/30*0.4928 = 0.00944 \]
Therefore, we will split based on Gender.
Assume:
\[M_{entropy}(t) = - P(Y = 0 | X \in t)\log P(Y = 0 | X \in t) - P(Y = 1 | X \in t)\log P(Y = 1 | X \in t)\]
Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:
\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]
If split on Gender: \[M_{entropy}(R) - \frac{10}{30}M_{entropy}(G:F) - \frac{20}{30}M_{entropy}(G:M) = 0.6931472 - 10/30\times 0.5004024 - 20/30\times 0.6474466 = 0.09471533 \]
If split on Class: \[M_{entropy}(R) - \frac{14}{30}M_{entropy}(C:4) - \frac{16}{30}M_{entropy}(C:5) = 0.6931472 - 14/30\times 0.6833149 - 16/30\times 0.6859298 = 0.008437687 \]
If split on Height:5.5 \[M_{entropy}(R) - \frac{12}{30}M_{entropy}(H:5.5-) - \frac{18}{30}M_{entropy}(H:5.5+) = 0.6931472 - 12/30 \times 0.680292 - 18/30\times 0.6859298 = 0.00947252 \]
Complexity for each split:
Construct the tree structure:
How to make a prediction:
\[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in t_m} \mathbb{I}(y_i = k)\]
E.g. if a new subject (G:Male, Class:X, H:5.8ft) falls into a node, we just do a majority vote.
\[\hat{c}_{m} = ave(y_i|x_i \in T_m)\]
library(ElemStatLearn)
library(rpart)
library(rpart.plot)
head(prostate)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
## train
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)
afit <- rpart(svi ~ . - train, data = prostate_train)
afit
## n= 67
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 67 11.641790 0.22388060
## 2) lcavol< 2.523279 52 2.826923 0.05769231
## 4) lpsa< 2.993028 43 0.000000 0.00000000 *
## 5) lpsa>=2.993028 9 2.000000 0.33333330 *
## 3) lcavol>=2.523279 15 2.400000 0.80000000 *
rpart.plot(afit)
For the top node:
predProb_cart <- predict(object = afit, newdata = prostate_test)
predProb_cart
## 7 9 10 15 22 25 26
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 28 32 34 36 42 44 48
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## 49 50 53 54 55 57 62
## 0.0000000 0.0000000 0.0000000 0.0000000 0.8000000 0.0000000 0.0000000
## 64 65 66 73 74 80 84
## 0.0000000 0.0000000 0.0000000 0.3333333 0.3333333 0.8000000 0.8000000
## 95 97
## 0.8000000 0.8000000
atable <- table(predictLabel = predProb_cart>0.5, trueLabel = prostate_test$svi)
atable
## trueLabel
## predictLabel 0 1
## FALSE 21 4
## TRUE 3 2
## accuracy
sum(diag(atable)) / sum(atable)
## [1] 0.7666667
aglm <- glm(svi ~ . - train, data = prostate_train, family = binomial(link = "logit"))
predProb_logistic <- predict(object = aglm, newdata = prostate_test, type="response")
btable <- table(predictLabel = predProb_logistic>0.5, trueLabel = prostate_test$svi)
btable
## trueLabel
## predictLabel 0 1
## FALSE 22 4
## TRUE 2 2
## accuracy
sum(diag(btable)) / sum(btable)
## [1] 0.8
The cost-complexity of \(T\) is \[C_\alpha (T) = \hat{R}(T) + \alpha |T|,\] where
Another way is to set \(\alpha = \frac{R(T_i) - R(T_{i-1})}{|T_{i-1}| - |T_{i}|}\)
library(titanic)
dim(titanic_train)
## [1] 891 12
head(titanic_train)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
library("randomForest")
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)
rfit <- randomForest(as.factor(svi) ~ . - train, data = prostate_train)
rfit
##
## Call:
## randomForest(formula = as.factor(svi) ~ . - train, data = prostate_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.43%
## Confusion matrix:
## 0 1 class.error
## 0 49 3 0.05769231
## 1 6 9 0.40000000
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
imp <- importance(rfit)
impData <- data.frame(cov = rownames(imp), importance=imp[,1])
ggplot(impData) + aes(x=cov, y=importance, fill=cov) + geom_bar(stat="identity")
pred_logistic <- predict(rfit, prostate_test)
ctable <- table(pred_logistic, trueLabel = prostate_test$svi)
ctable
## trueLabel
## pred_logistic 0 1
## 0 22 4
## 1 2 2
## accuracy
sum(diag(btable)) / sum(btable)
## [1] 0.8
Apply random forest on Titanic dataset to predict survival.
Element of stat learning
knitr::purl("cart.rmd", output = "cart.R ", documentation = 2)
##
##
## processing file: cart.rmd
##
|
| | 0%
|
|... | 5%
|
|...... | 9%
|
|......... | 14%
|
|............ | 18%
|
|............... | 23%
|
|.................. | 27%
|
|..................... | 32%
|
|........................ | 36%
|
|........................... | 41%
|
|.............................. | 45%
|
|................................ | 50%
|
|................................... | 55%
|
|...................................... | 59%
|
|......................................... | 64%
|
|............................................ | 68%
|
|............................................... | 73%
|
|.................................................. | 77%
|
|..................................................... | 82%
|
|........................................................ | 86%
|
|........................................................... | 91%
|
|.............................................................. | 95%
|
|.................................................................| 100%
## output file: cart.R
## [1] "cart.R "