Biostatistical Computing, PHC 6068

Decision Tree

Zhiguang Huo (Caleb)

Wednesday November 18, 2020

Outline

Supervised machine learning and unsupervised machine learning

  1. Classification (supervised machine learning):
    • With the class label known, build a model (classifier) to predict a future observation.
  2. Clustering (unsupervised machine learning)
    • Without knowing the class label, cluster the data according to their similarity.

Classification (supervised machine learning)

Clustering (unsupervised machine learning)

Decision Tree (what the result looks like)

library(ElemStatLearn)
library(rpart)
library(rpart.plot)
afit <- rpart(as.factor(svi) ~ . - train, data = prostate)
rpart.plot(afit)

Decision Tree

Motivating example

Purpose: predict whether a student will play cricket in his/her leisure time, based on his/her three variables.

Decision tree

Questions:

Decision tree structure

Terminology:

How does a tree decide where to split?

Goodness of split (GOS) criteria - impurity

Details: - Selects the split and cutoff which result in most pure subset of nodes. - explores to split all available nodes. - explores to split all possible cutoffs for each node.

Gini Index

Assume:

\[M_{Gini}(t) = 1 - P(Y = 0 | X \in t)^2 - P(Y = 1 | X \in t)^2\]

Gini Index

Goodness of split (GOS) criteria using Gini Index

Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:

\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]

Therefore, we will split based on Gender (maximized decrease).

Entropy

Assume:

\[M_{entropy}(t) = - P(Y = 0 | X \in t)\log P(Y = 0 | X \in t) - P(Y = 1 | X \in t)\log P(Y = 1 | X \in t)\]

Entropy

Goodness of split (GOS) criteria using entropy

Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:

\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]

Therefore, we will split based on Gender (maximized decrease).

Summary of impurity measurement

Complexity for each split:

Decision tree

How to make a prediction:

\[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in t_m} \mathbb{I}(y_i = k)\]

E.g. if a new subject (G:Male, Class:X, H:5.8ft) falls into a terminal node, we just do a majority vote.

Regression Trees vs Classification Trees

Prostate cancer example

library(ElemStatLearn)
library(rpart)
library(rpart.plot)
head(prostate)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE

Training set and testing set

For supervised machine learning, usually we split the data into the training set and the testing set.

Apply cart

prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)

afit <- rpart(as.factor(svi) ~ . - train, data = prostate_train) ## svi as binary outcome
#afit <- rpart(svi ~ . - train, data = prostate_train) ## svi as continuous outcome
afit
## n= 67 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 67 15 0 (0.77611940 0.22388060)  
##   2) lcavol< 2.523279 52  3 0 (0.94230769 0.05769231) *
##   3) lcavol>=2.523279 15  3 1 (0.20000000 0.80000000) *

Visualize cart result

rpart.plot(afit)

predicting the testing dataset using CART subject

predProb_cart <- predict(object = afit, newdata = prostate_test)
head(predProb_cart)
##            0          1
## 7  0.9423077 0.05769231
## 9  0.9423077 0.05769231
## 10 0.9423077 0.05769231
## 15 0.9423077 0.05769231
## 22 0.9423077 0.05769231
## 25 0.9423077 0.05769231
atable <- table(predictLabel = predProb_cart[,"1"]>0.5, trueLabel = prostate_test$svi)
atable
##             trueLabel
## predictLabel  0  1
##        FALSE 21  4
##        TRUE   3  2
## accuracy
sum(diag(atable)) / sum(atable)
## [1] 0.7666667

Overfitting problem

Pruning the tree (to reduce overfitting)

Titanic example (Will be on HW)

library(titanic)
dim(titanic_train)
## [1] 891  12
head(titanic_train)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp Parch
## 1                             Braund, Mr. Owen Harris   male  22     1     0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1     0
## 3                              Heikkinen, Miss. Laina female  26     0     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1     0
## 5                            Allen, Mr. William Henry   male  35     0     0
## 6                                    Moran, Mr. James   male  NA     0     0
##             Ticket    Fare Cabin Embarked
## 1        A/5 21171  7.2500              S
## 2         PC 17599 71.2833   C85        C
## 3 STON/O2. 3101282  7.9250              S
## 4           113803 53.1000  C123        S
## 5           373450  8.0500              S
## 6           330877  8.4583              Q

Bagging

Bagging

  1. Create Multiple DataSets:
    • Sampling with replacement on the original data
    • Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting
  2. Build Multiple Classifiers:
    • Classifiers are built on each data set.
  3. Combine Classifiers:
    • The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
    • The combined values are generally more robust than a single model.

Random forest

Random forest algorithm

Assume number of cases in the training set is N, and number of features is M.

  1. Repeat the following procedures B = 500 times (each time is a CART algorithm):
    1. Sample N cases with replacement.
    2. Sample m<M features without replacement.
    3. Apply the CART algorithm without pruning.
  2. Predict new data by aggregating the predictions of the B trees (i.e., majority votes for classification, average for regression).

Random forest

Random Forest on prostate cancer example

library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)

rffit <- randomForest(as.factor(svi) ~ . - train, data = prostate_train)
rffit
## 
## Call:
##  randomForest(formula = as.factor(svi) ~ . - train, data = prostate_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.43%
## Confusion matrix:
##    0 1 class.error
## 0 49 3  0.05769231
## 1  6 9  0.40000000

Random Forest on prostate prediction

pred_rf <- predict(rffit, prostate_test)

ctable <- table(pred_rf, trueLabel = prostate_test$svi)
ctable
##        trueLabel
## pred_rf  0  1
##       0 22  4
##       1  2  2
## accuracy
sum(diag(ctable)) / sum(ctable)
## [1] 0.8

Random Forest importance score

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
imp <- importance(rffit)
impData <- data.frame(cov = rownames(imp), importance=imp[,1])

ggplot(impData) + aes(x=cov, y=importance, fill=cov) + geom_bar(stat="identity")

How to calculate Random Forest importance score

Will be on homework

Apply random forest on Titanic dataset to predict survival.

References