Frontiers of Biostatistics, PHC 6937
Sparse Clustering 2
Zhiguang Huo (Caleb)
Fri Feb 9, 2018
Outline
- Hierarchical Clustering algorithm
- Sparse Hierarchical Clustering algorithm
Hierarchical Clustering example
set.seed(32611)
index <- sample(nrow(iris), 30)
iris_subset <- iris[index, 1:4]
species_subset <- iris$Species[index]
hc <- hclust(dist(iris_subset))
plot(hc)
Hierarchical Clustering example (with color)
suppressMessages(library(dendextend))
dend = as.dendrogram(hc)
labels_colors(dend) = as.numeric(species_subset)[order.dendrogram(dend)]
plot(dend)
legend("topright", legend=levels(species_subset), col=1:3, pch=1)
\(K\)-means and hierarchical clustering
- \(K\)-means: Partitioning
- hierarchical clustering: Hierarchical
Hierarchical methods
Hierarchical clustering methods produce a tree or dendrogram.
They avoid specifying how many clusters are appropriate by providing a partition for each \(k\) obtained from cutting the tree at some level.
- The tree can be built in two distinct ways:
- bottom-up: agglomerative clustering.
- top-down: divisive clustering.
Hierarchical clustering illustration
- After two data points are merged together, regard it as one single point.
Distance between clusters
Distance between clusters
Select a distance measurement \(d\) (e.g. Euclidean distance)
single linkage \[d(A, B) = \min_{x\in A, y \in B} d(x,y)\]
- average linkage \[d(A, B) = \frac{1}{N_AN_B}\sum_{x\in A, y \in B} d(x,y)\]
centroid linkage \[d(A, B) = d(\bar{x},\bar{y}),\] where \({x\in A, y \in B}\)
complete linkage \[d(A, B) = \max_{x\in A, y \in B} d(x,y)\]
Hierarchical tree ordering
- For the same distance matrix input, the leafs position can be random.
Hierarchical clustering algorithm
- Input: dissimilarity matrix (distance matrix \(d \in \mathbb{R}^{n \times n}\))
- Let \(T_n = \{C_1, C_2, \ldots, C_n\}\).
- For \(j = n - 1\) to \(1\):
- Find \(l,m\) to minimize \(d(C_l, C_m)\) over all \(C_l, C_m \in T_{j+1}\)
- Let \(T_j\) be the same as \(T_{j+1}\) except that \(C_l\) and \(C_m\) are replaced with \(C_l \cup C_m\)
- Return the sets of clusters \(T_1, \ldots, T_n\)
- The result can be represented as a tree, called a dendogram.
- We can then cut the tree at different places to yield any number of clusters.
Review on sparse \(K\)-means
\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j \left(\sum_{i} (x_{ji} - \bar{x}_j)^2 -\sum_{k=1}^K \sum_{i \in C_k} (x_{ji} - \bar{x}_{jC_K})^2 \right),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).
- A general framework for feature selection in clustering:
\[\max_{C, \textbf{w}} \sum_{j=1}^p w_j f_j(X_j, C),\] such that \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).
- We will try to fit hierarchical clustering into this framework.
Standard hierarchical clustering
- Perform hierarchical clustering algorithm to dissimilarity matrix \(d\)
- If we scale (multiply) \(d\) by a constant, the result won’t change.
Consider the following criterion:
\[\max_U \sum_j\sum_{i,i'} d_{i,i',j}U_{i,i'},\] Such that \(\sum_{i,i'}U_{i,i'}^2 \le 1\).
By optimization (KKT condition), we can show the optimum \(U^*\) satisfies \[(U_{i,i'}^*)^2 \propto \sum_j d_{i,i',j}\]
Applying hierarchical clustering algorithm to \(U^*\) will yield the same result.
Sparse hierarchical clustering
\[\max_{U,\textbf{w}} \sum_j w_j \sum_{i,i'} d_{i,i',j}U_{i,i'},\] Such that \(\sum_{i,i'}U_{i,i'}^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).
- Interpretation
- Select features \(w_j\) such that the dissimilarity on the feature \(j\) has larger separation ability.
- \(U^*\) is only related to a subset of features.
- Perform standard hierarchical clustering algorithm on \(U^*\).
- vectorize the variables:
- \(\textbf{u} \in \mathbb{R}^{n^2}\)
- \(\textbf{D} \in \mathbb{R}^{n^2\times p}\)
- \(\textbf{w} \in \mathbb{R}^{p}\)
- The objective function is equivalent to:
\[\max_{\textbf{u},\textbf{w}} \textbf{u}^\top \textbf{D} \textbf{w},\] Such that \(\|\textbf{u}\|^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).
Optimization
\[\max_{\textbf{u},\textbf{w}} \textbf{u}^\top \textbf{D} \textbf{w},\] Such that \(\|\textbf{u}\|^2 \le 1\), \(w_j \ge 0, \forall j\), \(\| \textbf{w}\|_1 \le \mu\), and \(\| \textbf{w} \|_2^2 \le 1\).
- Initialize \(\textbf{w}\) as \(w_1 = \ldots = w_p = \frac{1}{\sqrt{p}}\).
- Iterate until convergence:
- Update \(\textbf{u}\) fixing \(\textbf{w}\).
- Update \(\textbf{w}\) fixing \(\textbf{u}\).
- Rewrite \(\textbf{u}\) as \(U \in \mathbb{R}^{n\times n}\).
- Perform hierarchical clustering on the \(n \times n\) dissimilarity matrix \(U\)
Application example
library(sparcl)
set.seed(1)
x <- matrix(rnorm(100*50),ncol=50)
y <- c(rep(1,50),rep(2,50))
x[y==1,1:25] <- x[y==1,1:25]+2
sparsehc <- HierarchicalSparseCluster(x=x,wbound=4, method="complete")
## 1234567
ColorDendrogram(sparsehc$hc,y=y)
Homework:
- Sparse \(k\)-means.
- Apply sparse \(k\)-means algorithm to iris data to cluster the samples.
- Fix number of clusters to be \(K=3\).
- Use their method to choose tuning parameter wbounds.
- Visualize the data via PCA and color the samples based on labels from sparse \(k\)-means.
- Sparse hierarchical clustering.
- Apply sparse hierarchical clustering algorithm to iris data to cluster the samples.
- Use their method to choose tuning parameter wbounds.
- Visualize the dendrogram (color the samples with the underlying species) and feature selection plot.
- Homework submission suggested format: knitted html via R markdown.
- Any other format is OK (e.g. MS-word, pdf…).