Zhiguang Huo (Caleb)
Wednesday October 17, 2018
\[\min_C \sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_k}\|^2,\]
K <- 3
set.seed(32611)
centers <- mvrnorm(K, mu = c(0,0), Sigma = diag(c(1,1)))
colnames(centers) <- c("x", "y")
plot(d, pch=19)
points(centers, col = 2:4, pch=9, cex=2)
l2n <- function(avec){
return(sum(avec^2))
}
## update group labels
groupsDist <- matrix(0,nrow=nrow(d),ncol=K)
for(k in 1:K){
vecDiff <- t(d) - centers[k,]
al2n <- apply(vecDiff,2,l2n)
groupsDist[,k] <- al2n
}
groups <- apply(groupsDist,1,which.min)
plot(d, pch=19, col=groups + 1)
points(centers, pch=9, cex=2)
## update centers
for(k in 1:K){
asubset <- d[groups==k,]
centers[k,] <- colMeans(asubset)
}
groups0 <- groups ## save the previous clustering result, in order to test convergence
plot(d, pch=19)
points(centers, col = 2:4, pch=9, cex=2)
iris.data <- iris[,1:4]
ir.pca <- prcomp(iris.data,
center = TRUE,
scale = TRUE)
PC1 <- ir.pca$x[,"PC1"]
PC2 <- ir.pca$x[,"PC2"]
variance <- ir.pca$sdev^2 / sum(ir.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2)
legend("topright", legend = levels(iris$Species), col = unique(iris$Species), pch = 19)
par(mfrow=c(1,2))
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2, main="true label")
legend("topright", legend = levels(iris$Species), col = unique(iris$Species), pch = 19)
plot(PC1, PC2, col=as.numeric(kmeans_iris$cluster),pch=19, xlab=v1, ylab=v2, main="kmeans label")
legend("topright", legend = unique(kmeans_iris$cluster), col = unique(kmeans_iris$cluster), pch = 19)
\[\max Gap_n (k) = E_n^*(\log(W(k))) - \log(W(k))\]
library(cluster)
gsP.Z <- clusGap(d, FUN = kmeans, K.max = 8, B = 50)
plot(gsP.Z, main = "k = 3 cluster is optimal")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = d, FUNcluster = kmeans, K.max = 8, B = 50)
## B=50 simulated reference sets, k = 1..8; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstSEmax', SE.factor=1): 3
## logW E.logW gap SE.sim
## [1,] 5.769038 6.051226 0.2821886 0.01881499
## [2,] 5.412645 5.733809 0.3211639 0.02536302
## [3,] 5.118928 5.532351 0.4134237 0.02039893
## [4,] 5.015425 5.344578 0.3291527 0.02571314
## [5,] 4.909918 5.234112 0.3241944 0.01961303
## [6,] 4.840721 5.133776 0.2930550 0.02270741
## [7,] 4.708611 5.048810 0.3401989 0.02019276
## [8,] 4.653654 4.976211 0.3225574 0.02284422
suppressMessages(library(fossil))
set.seed(32611)
g1 <- sample(1:2, size=10, replace=TRUE)
g2 <- sample(1:3, size=10, replace=TRUE)
rand.index(g1, g2)
## [1] 0.4444444
## [1] 0.1697417
Hierarchical clustering methods produce a tree or dendrogram.
They avoid specifying how many clusters are appropriate by providing a partition for each \(k\) obtained from cutting the tree at some level.
Select a distance measurement \(d\) (e.g. Euclidean distance)
single linkage \[d(A, B) = \min_{x\in A, y \in B} d(x,y)\]
average linkage \[d(A, B) = \frac{1}{N_AN_B}\sum_{x\in A, y \in B} d(x,y)\]
centroid linkage \[d(A, B) = d(\bar{x},\bar{y}),\] where \({x\in A, y \in B}\)
complete linkage \[d(A, B) = \max_{x\in A, y \in B} d(x,y)\]
res_cutree <- cutree(hc, k = 1:5) #k = 1 is trivial
set.seed(32611)
result_km <- kmeans(d0, centers = 3)
table(res_cutree[,3], result_km$cluster)
##
## 1 2 3
## 1 50 0 0
## 2 0 50 0
## 3 0 0 50
## [1] 1
## [1] 1