Frontiers of Biostatistics, PHC 6937

Sparse Clustering

Zhiguang Huo (Caleb)

Wed Feb 7, 2018

Outline

Supervised machine learning and unsupervised machine learning

  1. Classification (supervised machine learning):
    • With the class label known.
    • Build a classifier using training dataset.
    • Evaluate the performance by a independent testing dataset.
    • Predict future data.
  2. Clustering (unsupervised machine learning)
    • Without knowing the class label
    • Cluster the data according to their similarity and obtained class labels.

Classification (supervised machine learning)

Classification (supervised machine learning)

Clustering (unsupervised machine learning)

Clustering (unsupervised machine learning)

\(k\)-means

\(k\)-means objective function

\[\min_C \sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_k}\|^2,\]

\(k\)-means algorithm

  1. Choose \(K\) centers \(c_1, \ldots, c_K\) as starting values.
  2. Form the clusters \(C_1, \ldots, C_K\) as follows.
    • Let \(g = (g_1, \ldots, g_n)\)
    • \(g_i = \arg \min_k \|x_i - c_k\|^2\).
    • \(C_k = \{i: g_i = k\}\)
  3. For \(k = 1, \ldots, K\), set \[c_k \rightarrow \frac{1}{|C_k|} \sum_{i\in C_k} x_i\]
  4. Repeat 2 and 3 until converge
  5. Output: centers \(c_1, \ldots, c_K\) and clusters \(C_1, \ldots, C_K\).

\(k\)-means 2-dimensional example

library(MASS)

set.seed(32611)
N<-100
d1<-mvrnorm(N, c(0,3), matrix(c(2, 1, 1, 2), 2, 2))
d2<-mvrnorm(N, c(-2,-2), matrix(c(2, 1, 1, 2), 2, 2))
d3<-mvrnorm(N, c(2,-2), matrix(c(2, 1, 1, 2), 2, 2))
d <- rbind(d1, d2, d3)
colnames(d) <- c("x", "y")
label <- c(rep("1", N), rep("2", N), rep("3", N))
plot(d, pch = 19, col=as.numeric(label))
legend("topleft", legend = unique(label), col=unique(as.numeric(label)), pch=19)

\(k\)-means demonstration

K <- 3
set.seed(32611)
centers <- mvrnorm(K, mu = c(0,0), Sigma = diag(c(1,1)))
colnames(centers) <- c("x", "y")

plot(d, pch=19)
points(centers, col = 2:4, pch=9, cex=2)


l2n <- function(avec){
  return(sqrt(sum(avec^2)))
}

## update group labels
groupsDist <- matrix(0,nrow=nrow(d),ncol=K)
for(k in 1:K){
  vecDiff <- t(d) - centers[k,]
  al2n <- apply(vecDiff,2,l2n)
  groupsDist[,k] <- al2n
}
groups <- apply(groupsDist,1,which.min)

plot(d, pch=19, col=groups + 1)
points(centers, pch=9, cex=2)



## update centers
for(k in 1:K){
  asubset <- d[groups==k,]
  centers[k,] <- colMeans(asubset)
}

groups0 <- groups

plot(d, pch=19)
points(centers, col = 2:4, pch=9, cex=2)

\(k\)-means in R

set.seed(32611)
akmeans <- kmeans(d, centers = 3)
kmeansCenters <- akmeans$centers
colnames(kmeansCenters) <- c("x","y")

plot(d, pch=19, col=akmeans$cluster + 1)
points(kmeansCenters, pch=9, cex=2)

When \(p>2\), iris data example

iris.data <- iris[,1:4]
ir.pca <- prcomp(iris.data,
                 center = TRUE,
                 scale = TRUE) 
PC1 <- ir.pca$x[,"PC1"]
PC2 <- ir.pca$x[,"PC2"]
variance <- ir.pca$sdev^2 / sum(ir.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2)
legend("topright", legend = levels(iris$Species), col =  unique(iris$Species), pch = 19)

Apply \(k\)-means

set.seed(32611)
kmeans_iris <- kmeans(iris.data, 3)

plot(PC1, PC2, col=as.numeric(kmeans_iris$cluster),pch=19, xlab=v1, ylab=v2)
legend("topright", legend = unique(kmeans_iris$cluster), col =  unique(kmeans_iris$cluster), pch = 19)

Put together (original lables and \(k\)-means labels)

par(mfrow=c(1,2))
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2, main="true label")
legend("topright", legend = levels(iris$Species), col =  unique(iris$Species), pch = 19)

plot(PC1, PC2, col=as.numeric(kmeans_iris$cluster),pch=19, xlab=v1, ylab=v2, main="kmeans label")
legend("topright", legend = unique(kmeans_iris$cluster), col =  unique(kmeans_iris$cluster), pch = 19)

When \(p>n\)

dim(iris.data)
## [1] 150   4

A Simple Approach to Sparse Clustering

Review for regularization

Intuition for lasso and ridge regression

From book Element of Statistical Learning

From book Element of Statistical Learning

Sparse \(k\)-means (1)

\[\min_C \sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2,\]

\[\min_C \sum_{j=1}^p \sum_{k=1}^K \sum_{i \in C_k} (x_{ji} - \bar{x}_{jC_K})^2,\]

\[\min_{C, \textbf{w}} \sum_{j=1}^p w_j \sum_{k=1}^K \sum_{i \in C_k} (x_{ji} - \bar{x}_{jC_K})^2 + \|\textbf{w}\|_1,\] such that \(w_j \ge 0, \forall j\).

Problem with (1)

\[\min_{C, \textbf{w}} \sum_{j=1}^p w_j \sum_{k=1}^K \sum_{i \in C_k} (x_{ji} - \bar{x}_{jC_K})^2 + \|\textbf{w}\|_1,\] such that \(w_j \ge 0, \forall j\).

Sparse \(k\)-means (2)

\[\min_C \sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2,\]

is equivalent to \[\max_C \sum_{i} \|x_i - \bar{x}\|^2 -\sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2,\] We can write the objective function of sparse \(k\)-means as: \[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} \|x_i - \bar{x}\|^2 -\sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2 \right) + \|\textbf{w}\|_1,\] such that \(w_j \ge 0, \forall j\).

Problem with (2)

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} \|x_i - \bar{x}\|^2 -\sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2 \right) + \|\textbf{w}\|_1,\] such that \(w_j \ge 0, \forall j\).

Sparse \(k\)-means (3)

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} \|x_i - \bar{x}\|^2 -\sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2 \right) + \|\textbf{w}\|_1,\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w} \|_2^2 \le 1\).

Sparse \(k\)-means (4)

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} \|x_i - \bar{x}\|^2 -\sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2 \right),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

Optimization for sparse \(k\)-means

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} \|x_i - \bar{x}\|^2 -\sum_{k=1}^K \sum_{i \in C_k} \|x_i - \bar{x}_{C_K}\|^2 \right),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

  1. Initialize \(\textbf{w}\) as \(w_1 = \ldots = w_p = \frac{1}{\sqrt{p}}\).
  2. Fixing \(\textbf{w}\), update \(C\) by weighted \(k\)-means.
  3. Fixing \(C\), update \(\textbf{w}\) by KKT condition.
  4. Iterate step 2 and 3 until converge.

R package for sparse \(k\)-means

library(sparcl)

set.seed(11)
x <- matrix(rnorm(50*70),ncol=70)
x[1:25,1:20] <- x[1:25,1:20]+1
x <- scale(x, TRUE, TRUE)

# run sparse $k$-means
km.out <- KMeansSparseCluster(x,K=2,wbounds=3)
## 012
print(km.out)
## Wbound is  3 :
## Number of non-zero weights:  13
## Sum of weights:  3.00005
## Clustering:  1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 
## 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
plot(km.out)

Exercise (Will be on homework)

References