Biostatistical Computing, PHC 6068

Supervised Machine Learning

Zhiguang Huo (Caleb)

Monday November 15, 2021

Outline

Supervised machine learning and unsupervised machine learning

  1. Classification (supervised machine learning):
    • With the class label known, build a model (classifier) to predict labels of future observations.
  2. Clustering (unsupervised machine learning)
    • Without knowing the class label, cluster the data according to their similarity.

Classification (supervised machine learning)

Clustering (unsupervised machine learning)

Decision Tree (what the result looks like)

library(ElemStatLearn)
library(rpart)
library(rpart.plot)
afit <- rpart(as.factor(svi) ~ . - train, data = prostate)
rpart.plot(afit)

Decision Tree

Motivating example

Purpose: predict whether a student will play cricket in his/her leisure time, based on his/her three variables.

Decision tree

Questions:

Decision tree structure

Terminology:

How does a tree decide where to split?

Gini Index

Assume:

\[M_{Gini}(t) = 1 - P(Y = 0 | X \in t)^2 - P(Y = 1 | X \in t)^2\]

Gini Index

Goodness of split (GOS) criteria using Gini Index

Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:

\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]

Therefore, we will split based on Gender (maximized decrease).

Entropy

Assume:

\[M_{entropy}(t) = - P(Y = 0 | X \in t)\log P(Y = 0 | X \in t) - P(Y = 1 | X \in t)\log P(Y = 1 | X \in t)\]

Entropy

Goodness of split (GOS) criteria using entropy

Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:

\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]

Therefore, we will split based on Gender (maximized decrease).

Summary of impurity measurement

Complexity for each split:

Decision tree

How to make a prediction:

\[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in t_m} \mathbb{I}(y_i = k)\]

E.g. if a new subject (G:Male, Class:X, H:5.8ft) falls into a terminal node, we just do a majority vote.

Regression Trees vs Classification Trees

Prostate cancer example

library(ElemStatLearn)
library(rpart)
library(rpart.plot)
knitr::kable(head(prostate))
lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
-0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829 TRUE
-0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE
-0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189 TRUE
-1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189 TRUE
0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636 TRUE
-1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678 TRUE

Training set and testing set

For supervised machine learning, usually we split the data into the training set and the testing set.

Apply cart

prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)

afit <- rpart(as.factor(svi) ~ . - train, data = prostate_train) ## svi as binary outcome
#afit <- rpart(svi ~ . - train, data = prostate_train) ## svi as continuous outcome
afit
## n= 67 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 67 15 0 (0.77611940 0.22388060)  
##   2) lcavol< 2.523279 52  3 0 (0.94230769 0.05769231) *
##   3) lcavol>=2.523279 15  3 1 (0.20000000 0.80000000) *

Visualize cart result

rpart.plot(afit)

predicting the testing dataset using CART subject

predProb_cart <- predict(object = afit, newdata = prostate_test)
head(predProb_cart)
##            0          1
## 7  0.9423077 0.05769231
## 9  0.9423077 0.05769231
## 10 0.9423077 0.05769231
## 15 0.9423077 0.05769231
## 22 0.9423077 0.05769231
## 25 0.9423077 0.05769231
atable <- table(predictLabel = predProb_cart[,"1"]>0.5, trueLabel = prostate_test$svi)
atable
##             trueLabel
## predictLabel  0  1
##        FALSE 21  4
##        TRUE   3  2
## accuracy
sum(diag(atable)) / sum(atable)
## [1] 0.7666667

Overfitting problem

Pruning the tree (to reduce overfitting)

Titanic example (Will be on HW)

library(titanic)
dim(titanic_train)
## [1] 891  12
knitr::kable(head(titanic_train))
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q

Bagging

Bagging

  1. Create Multiple DataSets:
    • Sampling with replacement on the original data
    • Taking row and column fractions less than 1 helps in making robust models, less prone to overfitting
  2. Build Multiple Classifiers:
    • Classifiers are built on each data set.
  3. Combine Classifiers:
    • The predictions of all the classifiers are combined using a mean, median or mode value depending on the problem at hand.
    • The combined values are generally more robust than a single model.

Random forest

Random forest algorithm

Assume number of cases in the training set is N, and number of features is M.

  1. Repeat the following procedures B = 500 times (each time is a CART algorithm):
    1. Sample N cases with replacement.
    2. Sample m<M features without replacement.
    3. Apply the CART algorithm without pruning.
  2. Predict new data by aggregating the predictions of the B trees (i.e., majority votes for classification, average for regression).

Random forest

Random Forest on prostate cancer example

library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)

rffit <- randomForest(as.factor(svi) ~ . - train, data = prostate_train)
rffit
## 
## Call:
##  randomForest(formula = as.factor(svi) ~ . - train, data = prostate_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 14.93%
## Confusion matrix:
##    0 1 class.error
## 0 48 4  0.07692308
## 1  6 9  0.40000000

Random Forest on prostate prediction

pred_rf <- predict(rffit, prostate_test)

ctable <- table(pred_rf, trueLabel = prostate_test$svi)
ctable
##        trueLabel
## pred_rf  0  1
##       0 22  4
##       1  2  2
## accuracy
sum(diag(ctable)) / sum(ctable)
## [1] 0.8

Random Forest importance score

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
imp <- importance(rffit)
impData <- data.frame(cov = rownames(imp), importance=imp[,1])

ggplot(impData) + aes(x=cov, y=importance, fill=cov) + geom_bar(stat="identity")

How to calculate Random Forest importance score

Outline for cross validation

Holdout method to evaluate machine learning accuracy

Holdout method example

library(ElemStatLearn)
library(randomForest)

prostate0 <- prostate
prostate0$svi <- as.factor(prostate0$svi)
trainLabel <- prostate$train
prostate0$train <- NULL

prostate_train <- subset(prostate0, trainLabel)
prostate_test <- subset(prostate0, !trainLabel)

rfFit <- randomForest(svi ~ ., data=prostate_train)
rfFit
## 
## Call:
##  randomForest(formula = svi ~ ., data = prostate_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 11.94%
## Confusion matrix:
##    0  1 class.error
## 0 49  3  0.05769231
## 1  5 10  0.33333333
rfPred <- predict(rfFit, prostate_test)
rfTable <- table(rfPred, truth=prostate_test$svi)
rfTable
##       truth
## rfPred  0  1
##      0 22  4
##      1  2  2
1 - sum(diag(rfTable))/sum(rfTable) ## error rate
## [1] 0.2

k-fold cross validation

Algorithm:

k-fold cross validation example

#Create K = 5 equally size folds

K <- 5
folds <- cut(seq(1,nrow(prostate0)),breaks=5,labels=FALSE)

EKs <- numeric(K)

for(k in 1:K){
  atrain <- subset(prostate0, folds != k)
  atest <- subset(prostate0, folds == k)

  krfFit <- randomForest(svi ~ ., data=atrain)
  krfPred <- predict(krfFit, atest)
  krfTable <- table(krfPred, truth=atest$svi)
  
  EKs[k] <- 1 - sum(diag(krfTable))/sum(krfTable)
}

EKs
## [1] 0.00000000 0.05263158 0.00000000 0.31578947 0.30000000
mean(EKs)
## [1] 0.1336842

Leave one out cross validation

For K-fold cross validation

leave one out cross validation example

n <- nrow(prostate0)

EKs <- numeric(n)

for(k in 1:n){
  #cat("leaving the",k,"sample out!","\n")
  atrain <- prostate0[-k,]
  atest <- prostate0[k,]

  krfFit <- randomForest(svi ~ ., data=atrain)
  krfPred <- predict(krfFit, atest)
  krfTable <- table(krfPred, truth=atest$svi)
  
  EKs[k] <- 1 - sum(diag(krfTable))/sum(krfTable)
}

EKs
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 0 0
## [77] 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
mean(EKs)
## [1] 0.1443299

Important note for cross validation

Choosing tuning parameters

Some machine learning classifiers contain tuning parameters

The tuning parameter is crucial since it controls the amount of regularization (model complexity).

Tuning parameter selection

Lasso problem

\[\hat{\beta}^{lasso} = \frac{1}{2}\arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1\]

library(lars)
## Loaded lars 1.2
library(ggplot2)
options(stringsAsFactors=F)
x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression

lambdas <- seq(0,5,0.5)
coefLassoFit <- coef(lassoFit, s=lambdas, mode="lambda") 
adataFrame <- NULL
for(i in 1:ncol(coefLassoFit)){
  acoef <- colnames(coefLassoFit)[i]
  adataFrame0 <- data.frame(lambda=lambdas, coef = coefLassoFit[,i], variable=acoef)
  adataFrame <- rbind(adataFrame, adataFrame0)
}

ggplot(adataFrame) + aes(x=lambda, y=coef, col=variable) + 
  geom_point() +
  geom_path()

How to select the best tuning parameter

Tuning parameter selection algorithm

Choose the tuning parameter \(\hat{\lambda} = \arg \min_{\lambda\in (\lambda_1, \ldots, \lambda_B)} e(\lambda)\)

Implement tuning parameter selection for lasso problem

K <- 10
folds <- cut(seq(1,nrow(prostate)),breaks=K,labels=FALSE)
lambdas <- seq(0.5,5,0.5)

rMSEs <- matrix(NA,nrow=length(lambdas),ncol=K)
rownames(rMSEs) <- lambdas
colnames(rMSEs) <- 1:K

for(k in 1:K){
  atrainx <- as.matrix(prostate[!folds==k,1:8])
  atestx <- as.matrix(prostate[folds==k,1:8])
 
  atrainy <- prostate[!folds==k,9]
  atesty <- prostate[folds==k,9]

  klassoFit <- lars(atrainx, atrainy) ## lar for least angle regression

  predictValue <- predict.lars(klassoFit, atestx, s=lambdas, type="fit",  mode="lambda")
  rMSE <- apply(predictValue$fit,2, function(x) sqrt(sum((x - atesty)^2))) ## square root MSE
  
  rMSEs[,k] <- rMSE
}

rMSEdataFrame <- data.frame(lambda=lambdas, rMSE = rowMeans(rMSEs))
ggplot(rMSEdataFrame) + aes(x=lambda, y = rMSE) + geom_point() + geom_path()

Common mistake about tuning parameter selection in machine learning

Combine selecting tuning parameters and evaluating prediction accuracy

  1. Split the data into training set and testing set.
  2. In training set, use cross validation to determine the best tuning parameter.
  3. Use the best tuning parameter and the entire training set to build a classifier.
  4. Evaluation:
    • Predict the testing set
    • Report the accuracy.

For Step 1, can we also use cross validation?

Nested Cross validation

  1. Use K-folder cross validation (outer) to split the original data into training set and testing set.
  2. For \(k^{th}\) fold training set, use cross validation (inner) to determine the best tuning parameter of the \(k^{th}\) fold.
  3. Use the best tuning parameter of the \(k^{th}\) fold and the \(k^{th}\) fold training set to build a classifier.
  4. Predict the \(k^{th}\) fold testing set and report the accuracy.
  5. Evaluation:
    • Report the average accuracy from outer K-folder cross validation..

References