Frontiers of Biostatistics, PHC 6937

Sparse Clustering 2

Zhiguang Huo (Caleb)

Fri Feb 9, 2018

Outline

Hierarchical Clustering algorithm
Sparse Hierarchical Clustering algorithm

Hierarchical Clustering example

set.seed(32611)
index <- sample(nrow(iris), 30)
iris_subset <- iris[index, 1:4]
species_subset <- iris$Species[index]

hc <- hclust(dist(iris_subset))
plot(hc)

Hierarchical Clustering example (with color)

suppressMessages(library(dendextend))
dend = as.dendrogram(hc)
labels_colors(dend) = as.numeric(species_subset)[order.dendrogram(dend)]

plot(dend)
legend("topright", legend=levels(species_subset), col=1:3, pch=1)

\(K\)-means and hierarchical clustering

\(K\)-means: Partitioning
hierarchical clustering: Hierarchical

Hierarchical methods

Hierarchical clustering methods produce a tree or dendrogram.
They avoid specifying how many clusters are appropriate by providing a partition for each \(k\) obtained from cutting the tree at some level.
The tree can be built in two distinct ways:
- bottom-up: agglomerative clustering.
- top-down: divisive clustering.

Hierarchical clustering illustration

After two data points are merged together, regard it as one single point.

Distance between clusters

Select a distance measurement \(d\) (e.g. Euclidean distance)

single linkage \[d(A, B) = \min_{x\in A, y \in B} d(x,y)\]
average linkage \[d(A, B) = \frac{1}{N_AN_B}\sum_{x\in A, y \in B} d(x,y)\]
centroid linkage \[d(A, B) = d(\bar{x},\bar{y}),\] where \({x\in A, y \in B}\)
complete linkage \[d(A, B) = \max_{x\in A, y \in B} d(x,y)\]

Hierarchical tree ordering

For the same distance matrix input, the leafs position can be random.

Hierarchical clustering algorithm

Input: dissimilarity matrix (distance matrix \(d \in \mathbb{R}^{n \times n}\))
Let \(T_n = \{C_1, C_2, \ldots, C_n\}\).
For \(j = n - 1\) to \(1\):
1. Find \(l,m\) to minimize \(d(C_l, C_m)\) over all \(C_l, C_m \in T_{j+1}\)
2. Let \(T_j\) be the same as \(T_{j+1}\) except that \(C_l\) and \(C_m\) are replaced with \(C_l \cup C_m\)
Return the sets of clusters \(T_1, \ldots, T_n\)

The result can be represented as a tree, called a dendogram.
We can then cut the tree at different places to yield any number of clusters.

Review on sparse \(K\)-means

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} (x_{ji} - \bar{x}_j)^2 -\sum_{k=1}^K \sum_{i \in C_k} (x_{ji} - \bar{x}_{jC_K})^2 \right),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

A general framework for feature selection in clustering:

\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j f_j(X_j, C),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

We will try to fit hierarchical clustering into this framework.

Standard hierarchical clustering

Perform hierarchical clustering algorithm to dissimilarity matrix \(d\)
If we scale (multiply) \(d\) by a constant, the result won’t change.

Consider the following criterion:

\[\max_U \sum_j\sum_{i,i'} d_{i,i',j}U_{i,i'},\] Such that \(\sum_{i,i'}U_{i,i'}^2 \le 1\).

By optimization (KKT condition), we can show the optimum \(U^*\) satisfies \[(U_{i,i'}^*)^2 \propto \sum_j d_{i,i',j}\]
Applying hierarchical clustering algorithm to \(U^*\) will yield the same result.

Sparse hierarchical clustering

\[\max_{U,\textbf{w}} \sum_j w_j \sum_{i,i'} d_{i,i',j}U_{i,i'},\] Such that \(\sum_{i,i'}U_{i,i'}^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

Interpretation
- Select features \(w_j\) such that the dissimilarity on the feature \(j\) has larger separation ability.
- \(U^*\) is only related to a subset of features.
- Perform standard hierarchical clustering algorithm on \(U^*\).
vectorize the variables:
- \(\textbf{u} \in \mathbb{R}^{n^2}\)
- \(\textbf{D} \in \mathbb{R}^{n^2\times p}\)
- \(\textbf{w} \in \mathbb{R}^{p}\)
The objective function is equivalent to:

\[\max_{\textbf{u},\textbf{w}} \textbf{u}^\top \textbf{D} \textbf{w},\] Such that \(\|\textbf{u}\|^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

Optimization

\[\max_{\textbf{u},\textbf{w}} \textbf{u}^\top \textbf{D} \textbf{w},\] Such that \(\|\textbf{u}\|^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).

This objective function is bi-convex in \(\textbf{u}\) and \(\textbf{w}\):
- When \(\textbf{u}\) is fixed, the objective function is convex with \(\textbf{w}\).
- When \(\textbf{w}\) is fixed, the objective function is convex with \(\textbf{u}\).
Convex objective function guarantees a global optimum solution.
Algorithm for sparse hierarchical clustering

Initialize \(\textbf{w}\) as \(w_1 = \ldots = w_p = \frac{1}{\sqrt{p}}\).
Iterate until convergence:
1. Update \(\textbf{u}\) fixing \(\textbf{w}\).
2. Update \(\textbf{w}\) fixing \(\textbf{u}\).
Rewrite \(\textbf{u}\) as \(U \in \mathbb{R}^{n\times n}\).
Perform hierarchical clustering on the \(n \times n\) dissimilarity matrix \(U\)

Application example

library(sparcl)

set.seed(1)
x <- matrix(rnorm(100*50),ncol=50)
y <- c(rep(1,50),rep(2,50))
x[y==1,1:25] <- x[y==1,1:25]+2
  
sparsehc <- HierarchicalSparseCluster(x=x,wbound=4, method="complete")

## 1234567

plot(sparsehc)

ColorDendrogram(sparsehc$hc,y=y)

Homework:

Sparse \(k\)-means.
- Apply sparse \(k\)-means algorithm to iris data to cluster the samples.
- Fix number of clusters to be \(K=3\).
- Use their method to choose tuning parameter wbounds.
- Visualize the data via PCA and color the samples based on labels from sparse \(k\)-means.
Sparse hierarchical clustering.
- Apply sparse hierarchical clustering algorithm to iris data to cluster the samples.
- Use their method to choose tuning parameter wbounds.
- Visualize the dendrogram (color the samples with the underlying species) and feature selection plot.

Homework submission suggested format: knitted html via R markdown.
Any other format is OK (e.g. MS-word, pdf…).

Frontiers of Biostatistics, PHC 6937

Sparse Clustering 2

Outline

Hierarchical Clustering example

Hierarchical Clustering example (with color)

\(K\)-means and hierarchical clustering

Hierarchical methods

Hierarchical clustering illustration

Distance between clusters

Distance between clusters

Hierarchical tree ordering

Hierarchical clustering algorithm

Review on sparse \(K\)-means

Standard hierarchical clustering

Sparse hierarchical clustering

Optimization

Application example

Homework:

References