Frontiers of Biostatistics, PHC 6937

Sparse Clustering 2

Zhiguang Huo (Caleb)

Fri Feb 9, 2018

Outline

Hierarchical Clustering example

set.seed(32611)
index <- sample(nrow(iris), 30)
iris_subset <- iris[index, 1:4]
species_subset <- iris$Species[index]

hc <- hclust(dist(iris_subset))
plot(hc)

Hierarchical Clustering example (with color)

suppressMessages(library(dendextend))
dend = as.dendrogram(hc)
labels_colors(dend) = as.numeric(species_subset)[order.dendrogram(dend)]

plot(dend)
legend("topright", legend=levels(species_subset), col=1:3, pch=1)

\(K\)-means and hierarchical clustering

Hierarchical methods

Hierarchical clustering illustration

Distance between clusters

Distance between clusters

Select a distance measurement \(d\) (e.g. Euclidean distance)

Hierarchical tree ordering

Hierarchical clustering algorithm

  1. Input: dissimilarity matrix (distance matrix \(d \in \mathbb{R}^{n \times n}\))
  2. Let \(T_n = \{C_1, C_2, \ldots, C_n\}\).
  3. For \(j = n - 1\) to \(1\):
    1. Find \(l,m\) to minimize \(d(C_l, C_m)\) over all \(C_l, C_m \in T_{j+1}\)
    2. Let \(T_j\) be the same as \(T_{j+1}\) except that \(C_l\) and \(C_m\) are replaced with \(C_l \cup C_m\)
  4. Return the sets of clusters \(T_1, \ldots, T_n\)

Review on sparse \(K\)-means

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} (x_{ji} - \bar{x}_j)^2 -\sum_{k=1}^K \sum_{i \in C_k} (x_{ji} - \bar{x}_{jC_K})^2 \right),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j f_j(X_j, C),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

Standard hierarchical clustering

Consider the following criterion:

\[\max_U \sum_j\sum_{i,i'} d_{i,i',j}U_{i,i'},\] Such that \(\sum_{i,i'}U_{i,i'}^2 \le 1\).

Sparse hierarchical clustering

\[\max_{U,\textbf{w}} \sum_j w_j \sum_{i,i'} d_{i,i',j}U_{i,i'},\] Such that \(\sum_{i,i'}U_{i,i'}^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

\[\max_{\textbf{u},\textbf{w}} \textbf{u}^\top \textbf{D} \textbf{w},\] Such that \(\|\textbf{u}\|^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

Optimization

\[\max_{\textbf{u},\textbf{w}} \textbf{u}^\top \textbf{D} \textbf{w},\] Such that \(\|\textbf{u}\|^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

  1. Initialize \(\textbf{w}\) as \(w_1 = \ldots = w_p = \frac{1}{\sqrt{p}}\).
  2. Iterate until convergence:
    1. Update \(\textbf{u}\) fixing \(\textbf{w}\).
    2. Update \(\textbf{w}\) fixing \(\textbf{u}\).
  3. Rewrite \(\textbf{u}\) as \(U \in \mathbb{R}^{n\times n}\).
  4. Perform hierarchical clustering on the \(n \times n\) dissimilarity matrix \(U\)

Application example

library(sparcl)

set.seed(1)
x <- matrix(rnorm(100*50),ncol=50)
y <- c(rep(1,50),rep(2,50))
x[y==1,1:25] <- x[y==1,1:25]+2
  
sparsehc <- HierarchicalSparseCluster(x=x,wbound=4, method="complete")
## 1234567
plot(sparsehc)

ColorDendrogram(sparsehc$hc,y=y)

Homework:

  1. Sparse \(k\)-means.
    • Apply sparse \(k\)-means algorithm to iris data to cluster the samples.
    • Fix number of clusters to be \(K=3\).
    • Use their method to choose tuning parameter wbounds.
    • Visualize the data via PCA and color the samples based on labels from sparse \(k\)-means.
  2. Sparse hierarchical clustering.
    • Apply sparse hierarchical clustering algorithm to iris data to cluster the samples.
    • Use their method to choose tuning parameter wbounds.
    • Visualize the dendrogram (color the samples with the underlying species) and feature selection plot.

References