Zhiguang Huo (Caleb)
Monday November 15, 2021
Purpose: predict whether a student will play cricket in his/her leisure time, based on his/her three variables.
Questions:
Terminology:
Assume:
\[M_{Gini}(t) = 1 - P(Y = 0 | X \in t)^2 - P(Y = 1 | X \in t)^2\]
Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:
\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]
If split on Gender: \[M_{Gini}(R) - \frac{10}{30}M_{Gini}(G:F) - \frac{20}{30}M_{Gini}(G:M) = 0.5 - 10/30\times 0.32 - 20/30\times 0.455 = 0.09 \]
If split on Class: \[M_{Gini}(R) - \frac{14}{30}M_{Gini}(C:4) - \frac{16}{30}M_{Gini}(C:5) = 0.5 - 14/30\times 0.4902 - 16/30\times 0.4928 = 0.008 \]
If split on Height:5.5 \[M_{Gini}(R) - \frac{12}{30}M_{Gini}(H:5.5-) - \frac{18}{30}M_{Gini}(H:5.5+) = 0.5 - 12/30*0.4872 - 18/30*0.4928 = 0.00944 \]
Therefore, we will split based on Gender (maximized decrease).
Assume:
\[M_{entropy}(t) = - P(Y = 0 | X \in t)\log P(Y = 0 | X \in t) - P(Y = 1 | X \in t)\log P(Y = 1 | X \in t)\]
Given an impurity function \(M(t)\), the GOS criterion is to find the split \(t_L\) and \(t_R\) of note \(t\) such that the impurity measure is maximally decreased:
\[\arg \max_{t_R, t_L} M(t) - [P(X\in t_L|X\in t) M(t_L) + P(X\in t_R|X\in t) M(t_R)]\]
If split on Gender: \[M_{entropy}(R) - \frac{10}{30}M_{entropy}(G:F) - \frac{20}{30}M_{entropy}(G:M) = 0.6931472 - 10/30\times 0.5004024 - 20/30\times 0.6474466 = 0.09471533 \]
If split on Class: \[M_{entropy}(R) - \frac{14}{30}M_{entropy}(C:4) - \frac{16}{30}M_{entropy}(C:5) = 0.6931472 - 14/30\times 0.6833149 - 16/30\times 0.6859298 = 0.008437687 \]
If split on Height:5.5 \[M_{entropy}(R) - \frac{12}{30}M_{entropy}(H:5.5-) - \frac{18}{30}M_{entropy}(H:5.5+) = 0.6931472 - 12/30 \times 0.680292 - 18/30\times 0.6859298 = 0.00947252 \]
Therefore, we will split based on Gender (maximized decrease).
Complexity for each split:
How to make a prediction:
\[\hat{p}_{mk} = \frac{1}{N_m} \sum_{x_i \in t_m} \mathbb{I}(y_i = k)\]
E.g. if a new subject (G:Male, Class:X, H:5.8ft) falls into a terminal node, we just do a majority vote.
lcavol | lweight | age | lbph | svi | lcp | gleason | pgg45 | lpsa | train |
---|---|---|---|---|---|---|---|---|---|
-0.5798185 | 2.769459 | 50 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.4307829 | TRUE |
-0.9942523 | 3.319626 | 58 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.1625189 | TRUE |
-0.5108256 | 2.691243 | 74 | -1.386294 | 0 | -1.386294 | 7 | 20 | -0.1625189 | TRUE |
-1.2039728 | 3.282789 | 58 | -1.386294 | 0 | -1.386294 | 6 | 0 | -0.1625189 | TRUE |
0.7514161 | 3.432373 | 62 | -1.386294 | 0 | -1.386294 | 6 | 0 | 0.3715636 | TRUE |
-1.0498221 | 3.228826 | 50 | -1.386294 | 0 | -1.386294 | 6 | 0 | 0.7654678 | TRUE |
For supervised machine learning, usually we split the data into the training set and the testing set.
prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)
afit <- rpart(as.factor(svi) ~ . - train, data = prostate_train) ## svi as binary outcome
#afit <- rpart(svi ~ . - train, data = prostate_train) ## svi as continuous outcome
afit
## n= 67
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 67 15 0 (0.77611940 0.22388060)
## 2) lcavol< 2.523279 52 3 0 (0.94230769 0.05769231) *
## 3) lcavol>=2.523279 15 3 1 (0.20000000 0.80000000) *
## 0 1
## 7 0.9423077 0.05769231
## 9 0.9423077 0.05769231
## 10 0.9423077 0.05769231
## 15 0.9423077 0.05769231
## 22 0.9423077 0.05769231
## 25 0.9423077 0.05769231
## trueLabel
## predictLabel 0 1
## FALSE 21 4
## TRUE 3 2
## [1] 0.7666667
The cost-complexity of \(T\) is \[C_\alpha (T) = \hat{R}(T) + \alpha |T|,\] where
Another way is to set \(\alpha = \frac{R(T_i) - R(T_{i-1})}{|T_{i-1}| - |T_{i}|}\)
## [1] 891 12
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S | |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S | |
6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.4583 | Q |
Assume number of cases in the training set is N, and number of features is M.
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
prostate_train <- subset(prostate, subset = train==TRUE)
prostate_test <- subset(prostate, subset = train==FALSE)
rffit <- randomForest(as.factor(svi) ~ . - train, data = prostate_train)
rffit
##
## Call:
## randomForest(formula = as.factor(svi) ~ . - train, data = prostate_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 14.93%
## Confusion matrix:
## 0 1 class.error
## 0 48 4 0.07692308
## 1 6 9 0.40000000
pred_rf <- predict(rffit, prostate_test)
ctable <- table(pred_rf, trueLabel = prostate_test$svi)
ctable
## trueLabel
## pred_rf 0 1
## 0 22 4
## 1 2 2
## [1] 0.8
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(ElemStatLearn)
library(randomForest)
prostate0 <- prostate
prostate0$svi <- as.factor(prostate0$svi)
trainLabel <- prostate$train
prostate0$train <- NULL
prostate_train <- subset(prostate0, trainLabel)
prostate_test <- subset(prostate0, !trainLabel)
rfFit <- randomForest(svi ~ ., data=prostate_train)
rfFit
##
## Call:
## randomForest(formula = svi ~ ., data = prostate_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 11.94%
## Confusion matrix:
## 0 1 class.error
## 0 49 3 0.05769231
## 1 5 10 0.33333333
## truth
## rfPred 0 1
## 0 22 4
## 1 2 2
## [1] 0.2
Algorithm:
Split the original dataset \(D\) into \(K\) folds, and obtain \(D_1, \ldots, D_k,\ldots, D_K\) such that \(D_1, \ldots, D_K\) are disjoint; \(|D_k|\) (size of \(D_k\)) is roughly the same for different \(k\) and \(\sum_k |D_k| = |D|\).
For each iteration \(k = 1, \ldots, K\), use the \(D_k\) as the testing dataset and \(D_{-k} = D - D_k\) as the training dataset. Build the classifier \(f^k\) based on \(D_{-k}\). Then predict \(D_k\) and obtain the prediction error rate \(E_k\)
Use \(E = \frac{1}{K} \sum_k E_k\) as the final prediction error rate.
#Create K = 5 equally size folds
K <- 5
folds <- cut(seq(1,nrow(prostate0)),breaks=5,labels=FALSE)
EKs <- numeric(K)
for(k in 1:K){
atrain <- subset(prostate0, folds != k)
atest <- subset(prostate0, folds == k)
krfFit <- randomForest(svi ~ ., data=atrain)
krfPred <- predict(krfFit, atest)
krfTable <- table(krfPred, truth=atest$svi)
EKs[k] <- 1 - sum(diag(krfTable))/sum(krfTable)
}
EKs
## [1] 0.00000000 0.05263158 0.00000000 0.31578947 0.30000000
## [1] 0.1336842
For K-fold cross validation
n <- nrow(prostate0)
EKs <- numeric(n)
for(k in 1:n){
#cat("leaving the",k,"sample out!","\n")
atrain <- prostate0[-k,]
atest <- prostate0[k,]
krfFit <- randomForest(svi ~ ., data=atrain)
krfPred <- predict(krfFit, atest)
krfTable <- table(krfPred, truth=atest$svi)
EKs[k] <- 1 - sum(diag(krfTable))/sum(krfTable)
}
EKs
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 1 1 0 0
## [77] 0 0 0 1 0 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
## [1] 0.1443299
Some machine learning classifiers contain tuning parameters
The tuning parameter is crucial since it controls the amount of regularization (model complexity).
\[\hat{\beta}^{lasso} = \frac{1}{2}\arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1\]
## Loaded lars 1.2
library(ggplot2)
options(stringsAsFactors=F)
x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
lambdas <- seq(0,5,0.5)
coefLassoFit <- coef(lassoFit, s=lambdas, mode="lambda")
adataFrame <- NULL
for(i in 1:ncol(coefLassoFit)){
acoef <- colnames(coefLassoFit)[i]
adataFrame0 <- data.frame(lambda=lambdas, coef = coefLassoFit[,i], variable=acoef)
adataFrame <- rbind(adataFrame, adataFrame0)
}
ggplot(adataFrame) + aes(x=lambda, y=coef, col=variable) +
geom_point() +
geom_path()
Split the original dataset \(D\) into \(K\) folds, and obtain \(D_1, \ldots, D_k,\ldots, D_K\) such that \(D_1, \ldots, D_K\) are disjoint; \(|D_k|\) (size of \(D_k\)) is roughly the same for different \(k\) and \(\sum_k |D_k| = |D|\).
pre-specify a range of tuning parameters \(\lambda_1, \ldots, \lambda_B\)
Use \(e(\lambda_b) = \frac{1}{K} \sum_k e_k(\lambda_b)\) as the average (MSE, rMSE or classification error rate) for tuning parameter \((\lambda_b)\).
Choose the tuning parameter \(\hat{\lambda} = \arg \min_{\lambda\in (\lambda_1, \ldots, \lambda_B)} e(\lambda)\)
Choose the tuning parameter \(\hat{\lambda} = \arg \min_{\lambda\in (\lambda_1, \ldots, \lambda_B)} e(\lambda)\)
Continuous outcome MSE: \[e(\lambda) = \frac{1}{K} \sum_{k=1}^K \sum_{i \in D_k} (y_i - \hat{f}_\lambda^{-k}(x_i))^2\]
Continuous outcome rMSE: \[e(\lambda) = \frac{1}{K} \sum_{k=1}^K \sqrt{\sum_{i \in D_k} (y_i - \hat{f}_\lambda^{-k}(x_i))^2}\]
Categorical outcome variable classification error rate: \[e(\lambda) = \frac{1}{K} \sum_{k=1}^K \frac{1}{|D_k|}\sum_{i \in D_k} \mathbb{I}(y_i \ne \hat{f}_\lambda^{-k}(x_i))\]
K <- 10
folds <- cut(seq(1,nrow(prostate)),breaks=K,labels=FALSE)
lambdas <- seq(0.5,5,0.5)
rMSEs <- matrix(NA,nrow=length(lambdas),ncol=K)
rownames(rMSEs) <- lambdas
colnames(rMSEs) <- 1:K
for(k in 1:K){
atrainx <- as.matrix(prostate[!folds==k,1:8])
atestx <- as.matrix(prostate[folds==k,1:8])
atrainy <- prostate[!folds==k,9]
atesty <- prostate[folds==k,9]
klassoFit <- lars(atrainx, atrainy) ## lar for least angle regression
predictValue <- predict.lars(klassoFit, atestx, s=lambdas, type="fit", mode="lambda")
rMSE <- apply(predictValue$fit,2, function(x) sqrt(sum((x - atesty)^2))) ## square root MSE
rMSEs[,k] <- rMSE
}
rMSEdataFrame <- data.frame(lambda=lambdas, rMSE = rowMeans(rMSEs))
ggplot(rMSEdataFrame) + aes(x=lambda, y = rMSE) + geom_point() + geom_path()
For Step 1, can we also use cross validation?