Biostatistical Computing, PHC 6068
Decision Tree
Zhiguang Huo (Caleb)
Wednesday October 24, 2018
Outline
- Classification and regression tree
- Random forest
Supervised machine learning and unsupervised machine learning
- Classification (supervised machine learning):
- With the class label known, learn the features of the classes to predict a future observation.
- The learning performance can be evaluated by the prediction error rate.
- Clustering (unsupervised machine learning)
- Without knowing the class label, cluster the data according to their similarity and learn the features.
- Normally the performance is difficult to evaluate and depends on the content of the problem.
Classification (supervised machine learning)
Clustering (unsupervised machine learning)
Decision Tree (what the result looks like)
- Target: build up a prediction rule to determine svi = 1 or 0
- Top node:
- 0: predicted svi status if a new subject falls into this node
- 0.22: probablity svi = 1
- 100%: all 97 subjects
- Bottom left:
- 0: predicted svi status if a new subject falls into this node
- 0.11: probablity svi = 1
- 87%: 84 subjects fall into this node
- Bottom right:
- 1: predicted svi status if a new subject falls into this node
- 0.92: probablity svi = 1
- 13%: 13 subjects fall into this node
Decision Tree
- Decision tree is a supervised learning algorithm (having a pre-defined outcome variable).
- It works for both categorical and continuous input and output variables.
- In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on the best splitter.
Motivating example
- 30 students with three variables
- Gender (Boy/ Girl)
- Class (IX or X)
- Height (5 to 6 ft)
- 15 out of these 30 play cricket in leisure time.
Purpose: predict whether a student will play cricket in his/her leisure time, based on his three variables.
Decision tree
- Decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population.
Questions:
- What does the decision tree structure look like?
- How to define homogeneous?
- How to make a prediction for a new person?
Decision tree structure
Terminology:
- Root Node: It represents entire population or sample and no split yet.
- Splitting: Dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
- Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
- Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
- Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.
How does a tree decide where to split?
Goodness of split (GOS) criteria
- Decision tree splits the nodes on all available variables.
- Decision tree splits at all possible cutoffs.
- Selects the split and cutoff which result in most homogeneous sub-nodes.
Recommended measure of impurity
- Gini Index
- Entropy
- Interpretation about impurity:
- 0 indicates pure
- large value indicates impure.
Gini Index
Assume:
- outcome variable \(Y\) is binary (\(Y = 0\) or \(Y = 1\)).
- \(t\) is a node
\[M_{Gini}(t) = 1 - P(Y = 0 | X \in t)^2 - P(Y = 1 | X \in t)^2\]
Gini Index
- root: \[M_{Gini}(R) = 1 - 0.5^2 - 0.5^2 = 0.5\]
- Gender:Female: \[M_{Gini}(G:F) = 1 - 0.2^2 - 0.8^2 = 0.32\]
- Gender:Male: \[M_{Gini}(G:M) = 1 - 0.65^2 - 0.35^2 = 0.455\]
- Class:IX: \[M_{Gini}(C:4) = 1 - 0.43^2 - 0.57^2 = 0.4902\]
- Class:X: \[M_{Gini}(C:5) = 1 - 0.56^2 - 0.44^2 = 0.4928\]
- Height:5.5+: \[M_{Gini}(H:5.5+) = 1 - 0.56^2 - 0.44^2 = 0.4928\]
- Height:5.5-: \[M_{Gini}(H:5.5-) = 1 - 0.42^2 - 0.58^2 = 0.4872\]
Goodness of split (GOS) criteria using Gini Index
Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:
\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]
If split on Gender: \[M_{Gini}(R) - \frac{10}{30}M_{Gini}(G:F) - \frac{20}{30}M_{Gini}(G:M)
= 0.5 - 10/30\times 0.32 - 20/30\times 0.455 = 0.09
\]
If split on Class: \[M_{Gini}(R) - \frac{14}{30}M_{Gini}(C:4) - \frac{16}{30}M_{Gini}(C:5)
= 0.5 - 14/30\times 0.4902 - 16/30\times 0.4928 = 0.008
\]
If split on Height:5.5 \[M_{Gini}(R) - \frac{12}{30}M_{Gini}(H:5.5-) - \frac{18}{30}M_{Gini}(H:5.5+)
= 0.5 - 12/30*0.4872 - 18/30*0.4928 = 0.00944
\]
Therefore, we will split based on Gender (maximized decrease).
- Why 5.5? Actually we need to search all possible height cutoffs to select the best cutoff.
Entropy
Assume:
- outcome variable \(Y\) is binary (\(Y = 0\) or \(Y = 1\)).
- \(t\) is a node
\[M_{entropy}(t) = - P(Y = 0 | X \in t)\log P(Y = 0 | X \in t)
- P(Y = 1 | X \in t)\log P(Y = 1 | X \in t)\]
Entropy
- root: \[M_{entropy}(R) = -0.5 \times \log(0.5) -0.5 \times \log(0.5) = 0.6931472\]
- Gender:Female: \[M_{entropy}(G:F) = -0.2 \times \log(0.2) -0.8 \times \log(0.8) = 0.5004024\]
- Gender:Male: \[M_{entropy}(G:M) = -0.65 \times \log(0.65) -0.35 \times \log(0.35) = 0.6474466\]
- Class:IX: \[M_{entropy}(C:4) = -0.43 \times \log(0.43) -0.57 \times \log(0.57) = 0.6833149\]
- Class:X: \[M_{entropy}(C:5) = -0.56 \times \log(0.56) -0.44 \times \log(0.44) = 0.6859298\]
- Height:5.5+: \[M_{entropy}(H:5.5+) = -0.56 \times \log(0.56) -0.44 \times \log(0.44) = 0.6859298\]
- Height:5.5-: \[M_{entropy}(H:5.5-) = -0.42 \times \log(0.42) -0.58 \times \log(0.58) = 0.680292\]
Goodness of split (GOS) criteria using entropy
Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:
\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]
If split on Gender: \[M_{entropy}(R) - \frac{10}{30}M_{entropy}(G:F) - \frac{20}{30}M_{entropy}(G:M)
= 0.6931472 - 10/30\times 0.5004024 - 20/30\times 0.6474466 = 0.09471533
\]
If split on Class: \[M_{entropy}(R) - \frac{14}{30}M_{entropy}(C:4) - \frac{16}{30}M_{entropy}(C:5)
= 0.6931472 - 14/30\times 0.6833149 - 16/30\times 0.6859298 = 0.008437687
\]
If split on Height:5.5 \[M_{entropy}(R) - \frac{12}{30}M_{entropy}(H:5.5-) - \frac{18}{30}M_{entropy}(H:5.5+)
= 0.6931472 - 12/30 \times 0.680292 - 18/30\times 0.6859298 = 0.00947252
\]
Therefore, we will split based on Gender (maximized decrease).
Summary of impurity measurement
- Categorical outcome
- Gini Index
- Entropy
- Chi-Square
- Continuous outcome
Complexity for each split:
- \(O(np)\)
- for each variable, search for all possible cutoffs (n).
- Repeat this for all variables.
Decision tree
Construct the tree structure:
- continue to split the tree until we stop at certain criteria
- Overfitting if we split to the end.
- Underfitting if we didn’t split enough.
- Will discuss when to stop later.
How to make a prediction:
\[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in t_m} \mathbb{I}(y_i = k)\]
E.g. if a new subject (G:Male, Class:X, H:5.8ft) falls into a node, we just do a majority vote.
Regression Trees vs Classification Trees
- Regression trees are used when dependent variable is continuous. Classification trees are used when dependent variable is categorical.
- In case of regression tree, the value obtained by terminal nodes is the mean response of observation falling in that region. \[\hat{c}_{m} = ave(y_i|x_i \in T_m)\]
- In case of classification tree, the value (class) obtained by terminal node is the mode of observations falling in that region.
- Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions.
- Both the trees follow a top-down greedy approach known as recursive binary splitting.
- In both the cases, the splitting process results in fully grown trees. Then we prune the tree to reduce overfitting.
Prostate cancer example
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
## train
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
Training set and testing set
For supervised machine learning, usually we split the data into the training set and the testing set.
- Training set: to build the classifier (build the decision rule in CART model)
- Testing set: evaluate the performance of the classifier from the training set.
Spliting rule:
- Training set/Testing set
- Cross validation
Apply cart
## n= 67
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 67 15 0 (0.77611940 0.22388060)
## 2) lcavol< 2.523279 52 3 0 (0.94230769 0.05769231) *
## 3) lcavol>=2.523279 15 3 1 (0.20000000 0.80000000) *
Visualize cart result
- Top node:
- 0: predicted svi status if a new subject falls into this node
- 0.22: probablity svi = 1
- 100%: all 67 subjects
- Bottom left:
- 0: predicted svi status if a new subject falls into this node
- 0.06: probablity svi = 1
- 78%: 52 subjects fall into this node
- Bottom right:
- 1: predicted svi status if a new subject falls into this node
- 0.80: probablity svi = 1
- 22%: 15 subjects fall into this node
predicting the testing dataset using CART subject
## 0 1
## 7 0.9423077 0.05769231
## 9 0.9423077 0.05769231
## 10 0.9423077 0.05769231
## 15 0.9423077 0.05769231
## 22 0.9423077 0.05769231
## 25 0.9423077 0.05769231
## trueLabel
## predictLabel 0 1
## FALSE 21 4
## TRUE 3 2
## [1] 0.7666667
Overfitting
Pruning the tree (to reduce overfitting)
- Brieman et al (1984) proposed a backward node recombinaiton strategy called minimal cost-complexity pruning.
- The subtrees are nested sequence: \[T_{max} = T_1 \subset T_1 \subset \ldots \subset T_m\]
The cost-complexity of \(T\) is \[C_\alpha (T) = \hat{R}(T) + \alpha |T|,\] where
- \(\hat{R}(T)\) is the error rate estimated based on tree \(T\).
- \(T\) is the tree structure.
- \(|T|\) is the number of terminal nodes of \(T\).
- \(\alpha\) is a positive complexity parameter that balance between bias and variance (complexity).
- \(\alpha\) is estimated from cross-validation.
Another way is to set \(\alpha = \frac{R(T_i) - R(T_{i-1})}{|T_{i-1}| - |T_{i}|}\)
Titanic example (Will be on HW)
## [1] 891 12
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
Bagging
Bagging
- Create Multiple DataSets:
- Sampling with replacement on the original data
- Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting
- Build Multiple Classifiers:
- Classifiers are built on each data set.
- Combine Classifiers:
- The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
- The combined values are generally more robust than a single model.
Random forest
Random forest algorithm
Assume number of cases in the training set is N, and number of features is M.
- Repeat the following procedures B = 500 times (each time is a CART algorithm):
- Sample N cases with replacement.
- Sample m<M features without replacement.
- Apply the CART algorithm without pruning.
- Predict new data by aggregating the predictions of the B trees (i.e., majority votes for classification, average for regression).
Random forest
Random Forest on prostate cancer example
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Call:
## randomForest(formula = as.factor(svi) ~ . - train, data = prostate_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.43%
## Confusion matrix:
## 0 1 class.error
## 0 49 3 0.05769231
## 1 6 9 0.40000000
Random Forest on prostate prediction
## trueLabel
## pred_rf 0 1
## 0 22 4
## 1 2 2
## [1] 0.8
Random Forest importance score
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
How to calculate Random Forest importance score
- Within each tree \(t\),
- \(\hat{y}_i^{(t)}\): predicted class before permuting
- \(\hat{y}_{i,\pi_j}^{(t)}\): predicted class after permuting \(x_j\) \[VI^{(t)}(x_j) = \frac{1}{|n^{(t)}|} \sum_{i \in n^{(t)}} I(y_i = \hat{y}_i^{(t)}) -
\frac{1}{|n^{(t)}|} \sum_{i \in n^{(t)}} I(y_i = \hat{y}_{i,\pi_j}^{(t)})\]
- Raw importance: \[VI(x_j) = \frac{\sum_{t=1}^B VI^{(t)}(x_j)}{B}\]
- Scaled improtance: importance() funciton in R: \[z_j = \frac{VI(x_j)}{\hat{\sigma}/\sqrt{B}}\]
Will be on homework
Apply random forest on Titanic dataset to predict survival.
- What are the importance factors?
- How does the performance compare to CART.