Supervised machine learning and unsupervised machine learning

Classification (supervised machine learning):
- With the class label known, learn the features of the classes to predict a future observation.
- The learning performance can be evaluated by the prediction error rate.
Clustering (unsupervised machine learning)
- Without knowing the class label, cluster the data according to their similarity and learn the features.
- Normally the performance is difficult to evaluate and depends on the content of the problem.

Classification (supervised machine learning)

Clustering (unsupervised machine learning)

Decision Tree

library(ElemStatLearn)
library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.3.2

afit <- rpart(svi ~ . - train, data = prostate)
rpart.plot(afit)

Target: build up a prediction rule to determine svi = 1 or 0
100%: all 97 subjects
0.22: probablity svi = 1

Decision Tree

Decision tree is a supervised learning algorithm (having a pre-defined outcome variable).
It works for both categorical and continuous input and output variables.
In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter.

Motivating example

30 students with three variables
- Gender (Boy/ Girl)
- Class (IX/ X)
- Height (5 to 6 ft)
15 out of these 30 play cricket in leisure time.

Purpose: predict whether a student will play cricket in his/her leisure time.

Decision tree

Decision tree identifies the most significant variable and it’s value that gives best homogeneous sets of population.

Questions:

What does the decision tree structure look like?
How to define homogeneous?
How to make a prediction for a new person?

Decision tree structure

Terminology:

Root Node: It represents entire population or sample and no split yet.
Splitting: Dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

How does a tree decide where to split?

Goodness of split (GOS) criteria

Decision tree splits the nodes on all available variables.
Decision tree splits at all possible cutoffs.
Selects the split and cutoff which result in most homogeneous sub-nodes.

Recommended measure of impurity

Gini Index
Entropy
Interpretation about impurity:
- 0 means pure
- large value indicates impure.

Gini Index

Assume:

outcome variable \(Y\) is binary (\(Y = 0\) or \(Y = 1\)).
\(t\) is a node

\[M_{Gini}(t) = 1 - P(Y = 0 | X \in t)^2 - P(Y = 1 | X \in t)^2\]

Gini Index

root: \[M_{Gini}(R) = 1 - 0.5^2 - 0.5^2 = 0.5\]
Gender:Female: \[M_{Gini}(G:F) = 1 - 0.2^2 - 0.8^2 = 0.32\]
Gender:Male: \[M_{Gini}(G:M) = 1 - 0.65^2 - 0.35^2 = 0.455\]
Class:IX: \[M_{Gini}(C:4) = 1 - 0.43^2 - 0.57^2 = 0.4902\]
Class:X: \[M_{Gini}(C:5) = 1 - 0.56^2 - 0.44^2 = 0.4928\]
Height:5.5+: \[M_{Gini}(H:5.5+) = 1 - 0.56^2 - 0.44^2 = 0.4928\]
Height:5.5-: \[M_{Gini}(H:5.5-) = 1 - 0.42^2 - 0.58^2 = 0.4872\]

Goodness of split (GOS) criteria using Gini Index

Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:

\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]

If split on Gender: \[M_{Gini}(R) - \frac{10}{30}M_{Gini}(G:F) - \frac{20}{30}M_{Gini}(G:M) = 0.5 - 10/30\times 0.32 - 20/30\times 0.455 = 0.09 \]
If split on Class: \[M_{Gini}(R) - \frac{14}{30}M_{Gini}(C:4) - \frac{16}{30}M_{Gini}(C:5) = 0.5 - 14/30\times 0.4902 - 16/30\times 0.4928 = 0.008 \]
If split on Height:5.5 \[M_{Gini}(R) - \frac{12}{30}M_{Gini}(H:5.5-) - \frac{18}{30}M_{Gini}(H:5.5+) = 0.5 - 12/30*0.4872 - 18/30*0.4928 = 0.00944 \]

Therefore, we will split based on Gender.

Why 5.5? Actually we need to search all possible height cutoffs to select the best cutoff.

Entropy

Assume:

outcome variable \(Y\) is binary (\(Y = 0\) or \(Y = 1\)).
\(t\) is a node

\[M_{entropy}(t) = - P(Y = 0 | X \in t)\log P(Y = 0 | X \in t) - P(Y = 1 | X \in t)\log P(Y = 1 | X \in t)\]

Entropy

root: \[M_{entropy}(R) = -0.5 \times \log(0.5) -0.5 \times \log(0.5) = 0.6931472\]
Gender:Female: \[M_{entropy}(G:F) = -0.2 \times \log(0.2) -0.8 \times \log(0.8) = 0.5004024\]
Gender:Male: \[M_{entropy}(G:M) = -0.65 \times \log(0.65) -0.35 \times \log(0.35) = 0.6474466\]
Class:IX: \[M_{entropy}(C:4) = -0.43 \times \log(0.43) -0.57 \times \log(0.57) = 0.6833149\]
Class:X: \[M_{entropy}(C:5) = -0.56 \times \log(0.56) -0.44 \times \log(0.44) = 0.6859298\]
Height:5.5+: \[M_{entropy}(H:5.5+) = -0.56 \times \log(0.56) -0.44 \times \log(0.44) = 0.6859298\]
Height:5.5-: \[M_{entropy}(H:5.5-) = -0.42 \times \log(0.42) -0.58 \times \log(0.58) = 0.680292\]

Goodness of split (GOS) criteria using entropy

Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:

\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]

If split on Gender: \[M_{entropy}(R) - \frac{10}{30}M_{entropy}(G:F) - \frac{20}{30}M_{entropy}(G:M) = 0.6931472 - 10/30\times 0.5004024 - 20/30\times 0.6474466 = 0.09471533 \]
If split on Class: \[M_{entropy}(R) - \frac{14}{30}M_{entropy}(C:4) - \frac{16}{30}M_{entropy}(C:5) = 0.6931472 - 14/30\times 0.6833149 - 16/30\times 0.6859298 = 0.008437687 \]
If split on Height:5.5 \[M_{entropy}(R) - \frac{12}{30}M_{entropy}(H:5.5-) - \frac{18}{30}M_{entropy}(H:5.5+) = 0.6931472 - 12/30 \times 0.680292 - 18/30\times 0.6859298 = 0.00947252 \]

Summary of impurity measurement

Categorical outcome
- Gini Index
- Entropy
- Chi-Square
Continuous outcome
- Reduction in Variance

Complexity for each split:

\(O(np)\)

Decision tree

Construct the tree structure:

continue to split the tree until we stop at certain criteria
- Overfitting if we split to the end.
- Underfitting if we didn’t split enough.
- Will discuss when to stop later.

How to make a prediction:

\[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in t_m} \mathbb{I}(y_i = k)\]

E.g. if a new subject (G:Male, Class:X, H:5.8ft) falls into a node, we just do a majority vote.

Regression tree

\[\hat{c}_{m} = ave(y_i|x_i \in T_m)\]

Regression Trees vs Classification Trees

Regression trees are used when dependent variable is continuous. Classification trees are used when dependent variable is categorical.
In case of regression tree, the value obtained by terminal nodes in the training data is the mean response of observation falling in that region.
In case of classification tree, the value (class) obtained by terminal node in the training data is the mode of observations falling in that region.
Both the trees divide the predictor space (independent variables) into distinct and non-overlapping regions.
Both the trees follow a top-down greedy approach known as recursive binary splitting.
In both the cases, the splitting process results in fully grown trees. Then we prune the tree to tackle overfitting.

Prostate cancer example

library(ElemStatLearn)
library(rpart)
library(rpart.plot)
head(prostate)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
##   train
## 1  TRUE
## 2  TRUE
## 3  TRUE
## 4  TRUE
## 5  TRUE
## 6  TRUE

Apply cart

prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)

afit <- rpart(svi ~ . - train, data = prostate_train)
afit

## n= 67 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 67 11.641790 0.22388060  
##   2) lcavol< 2.523279 52  2.826923 0.05769231  
##     4) lpsa< 2.993028 43  0.000000 0.00000000 *
##     5) lpsa>=2.993028 9  2.000000 0.33333330 *
##   3) lcavol>=2.523279 15  2.400000 0.80000000 *

Visualize cart result

rpart.plot(afit)

For the top node:

100%: all 67 subjects
0.22: probablity svi = 1

predicting the testing dataset using CART subject

predProb_cart <- predict(object = afit, newdata = prostate_test)
predProb_cart

##         7         9        10        15        22        25        26 
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 
##        28        32        34        36        42        44        48 
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 
##        49        50        53        54        55        57        62 
## 0.0000000 0.0000000 0.0000000 0.0000000 0.8000000 0.0000000 0.0000000 
##        64        65        66        73        74        80        84 
## 0.0000000 0.0000000 0.0000000 0.3333333 0.3333333 0.8000000 0.8000000 
##        95        97 
## 0.8000000 0.8000000

atable <- table(predictLabel = predProb_cart>0.5, trueLabel = prostate_test$svi)
atable

##             trueLabel
## predictLabel  0  1
##        FALSE 21  4
##        TRUE   3  2

## accuracy
sum(diag(atable)) / sum(atable)

## [1] 0.7666667

Compare with logistic regression

aglm <- glm(svi ~ . - train, data = prostate_train, family = binomial(link = "logit"))
predProb_logistic <- predict(object = aglm, newdata = prostate_test, type="response")
btable <- table(predictLabel = predProb_logistic>0.5, trueLabel = prostate_test$svi)
btable

##             trueLabel
## predictLabel  0  1
##        FALSE 22  4
##        TRUE   2  2

## accuracy
sum(diag(btable)) / sum(btable)

## [1] 0.8

Why CART not logistic regression or linear regression

If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.
If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.
If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!
CART can also accommodate missing data.

Model complexity (how deep should we keep the tree)

Like every other model, a tree based model also needs to balance the tradeoff of bias and variance.
Bias means, ‘how much on an average are the predicted values different from the actual value.’
Variance means, ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’.

Pruning the tree

Brieman et al (1984) proposed a backward node recombinaiton strategy called minimal cost-complexity pruning.
The subtrees are nested sequence: \[T_{max} = T_1 \subset T_1 \subset \ldots \subset T_m\]
The cost-complexity of \(T\) is \[C_\alpha (T) = \hat{R}(T) + \alpha |T|,\] where
\(\hat{R}(T)\) is the error rate estimated based on tree \(T\).
\(T\) is the tree structure.
\(|T|\) is the number of terminal nodes of \(T\).
\(\alpha\) is a positive complexity parameter that balance between bias and variance (complexity).
\(\alpha\) is estimated from cross-validation.
Another way is to set \(\alpha = \frac{R(T_i) - R(T_{i-1})}{|T_{i-1}| - |T_{i}|}\)

Titanic example (Will be on HW)

library(titanic)
dim(titanic_train)

## [1] 891  12

head(titanic_train)

##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q

Bagging

Create Multiple DataSets:
- Sampling with replacement on the original data
- Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting
Build Multiple Classifiers:
- Classifiers are built on each data set.
Combine Classifiers:
- The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
- The combined values are generally more robust than a single model.

Random forest algorithm

Assume number of cases in the training set is N. Then, sample of these N cases with replacement, which will be the training set for growing the tree.
If there are M input variables, a number m<M is specified such that at each node, m variables are selected at random out of the M. The best split on these m is used to split the node. The value of m is held constant while we grow the forest.
Each tree is grown to the largest extent possible and there is no pruning.
Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for classification, average for regression).

Random Forest on prostate cancer example

library("randomForest")

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)

rfit <- randomForest(as.factor(svi) ~ . - train, data = prostate_train)
rfit

## 
## Call:
##  randomForest(formula = as.factor(svi) ~ . - train, data = prostate_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.43%
## Confusion matrix:
##    0 1 class.error
## 0 49 3  0.05769231
## 1  6 9  0.40000000

Random Forest on prostate cancer example

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

imp <- importance(rfit)
impData <- data.frame(cov = rownames(imp), importance=imp[,1])

ggplot(impData) + aes(x=cov, y=importance, fill=cov) + geom_bar(stat="identity")

Random Forest on prostate prediction

pred_logistic <- predict(rfit, prostate_test)

ctable <- table(pred_logistic, trueLabel = prostate_test$svi)
ctable

##              trueLabel
## pred_logistic  0  1
##             0 22  4
##             1  2  2

## accuracy
sum(diag(btable)) / sum(btable)

## [1] 0.8

Will be on homework

Apply random forest on Titanic dataset to predict survival.

What are the importance factors?
How does the performance compare to CART and logistic regression.

reference

https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
Element of stat learning

knitr::purl("cart.rmd", output = "cart.R ", documentation = 2)

## 
## 
## processing file: cart.rmd

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |...                                                              |   5%
  |                                                                       
  |......                                                           |   9%
  |                                                                       
  |.........                                                        |  14%
  |                                                                       
  |............                                                     |  18%
  |                                                                       
  |...............                                                  |  23%
  |                                                                       
  |..................                                               |  27%
  |                                                                       
  |.....................                                            |  32%
  |                                                                       
  |........................                                         |  36%
  |                                                                       
  |...........................                                      |  41%
  |                                                                       
  |..............................                                   |  45%
  |                                                                       
  |................................                                 |  50%
  |                                                                       
  |...................................                              |  55%
  |                                                                       
  |......................................                           |  59%
  |                                                                       
  |.........................................                        |  64%
  |                                                                       
  |............................................                     |  68%
  |                                                                       
  |...............................................                  |  73%
  |                                                                       
  |..................................................               |  77%
  |                                                                       
  |.....................................................            |  82%
  |                                                                       
  |........................................................         |  86%
  |                                                                       
  |...........................................................      |  91%
  |                                                                       
  |..............................................................   |  95%
  |                                                                       
  |.................................................................| 100%

## output file: cart.R

## [1] "cart.R "

Biostatistical Computing, PHC 6068

Decision Tree

Outline

Supervised machine learning and unsupervised machine learning

Classification (supervised machine learning)

Clustering (unsupervised machine learning)

Decision Tree

Decision Tree

Motivating example

Decision tree

Decision tree structure

How does a tree decide where to split?

Recommended measure of impurity

Gini Index

Gini Index

Goodness of split (GOS) criteria using Gini Index

Entropy

Entropy

Goodness of split (GOS) criteria using entropy

Summary of impurity measurement

Decision tree

Regression tree

Regression Trees vs Classification Trees

Prostate cancer example

Apply cart

Visualize cart result

predicting the testing dataset using CART subject

Compare with logistic regression

Why CART not logistic regression or linear regression

Model complexity (how deep should we keep the tree)

Pruning the tree

Titanic example (Will be on HW)

Bagging

Bagging

Random forest

Random forest algorithm

Random Forest on prostate cancer example

Random Forest on prostate cancer example

Random Forest on prostate prediction

Will be on homework

reference