Another motivating example

Usually use first principle component and second principle component direction to visualize the data
Motivating example: iris data, 150 observations, 4 variables (features)

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

dim(iris)

## [1] 150   5

Perform PCA
- project these 4 features onto 2 dimensional space
visualize the data.

iris.data <- iris[,1:4]
ir.pca <- prcomp(iris.data,
                 center = TRUE,
                 scale = TRUE) 
PC1 <- ir.pca$x[,"PC1"]
PC2 <- ir.pca$x[,"PC2"]
variance <- ir.pca$sdev^2 / sum(ir.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2)
legend("topright", legend = levels(iris$Species), col =  unique(iris$Species), pch = 19)

Biplot

Data points: Projection on first two PCs Distance in Biplot.
Projection of sample onto arrow gives original (scaled) value of that variable.
Arrowlength: Variance of variable.
Angle between Arrows: Correlation.

biplot(ir.pca)

Geometrical motivation for PCA (population version)

\({\bf x} \in \mathbb{R}^{G}\) is one sample with \(G\) features.
Without loss of generosity,
- assume \(\mathbb{E}({\bf x}) = 0\)
- \(Var({\bf x}) = \mathbb{E}({\bf x}^\top {\bf x}) = \Sigma \in \mathbb{R}^{G\times G}\)
We want to a direction \(\alpha \in \mathbb{R}^{G}\) such that the variance of the projected value (\(\alpha^\top {\bf x}\)) is maximized. \[\max_\alpha Var(\alpha^\top {\bf x}) = \max_\alpha \alpha^\top \Sigma \alpha, s.t. \|\alpha\| = 1\]
solution: \(\alpha = {\bf v}_1\) which is the first eigen-value of \(\Sigma\)

Proof (page 1):

Since \(\Sigma \in \mathbb{R}^{G\times G}\) is a symmetric and positive - definite matrix, there exists matrix \(V \in \mathbb{R}^{G\times G}\) and diagonal matrix \(D \in \mathbb{R}^{G\times G}\) such that \[\Sigma = V D V^\top,\]

where \(D = \begin{pmatrix} d_1 & 0 & 0 & \\ 0 & d_2 & & \\ 0 & & \ddots & 0\\ & & 0 & d_G \end{pmatrix},\)

\(d_1 \ge d_2 \ge \ldots \ge d_G \ge 0\) are eigenvalues. \({\bf V} = ({\bf v}_1, {\bf v}_2, \ldots, {\bf v}_G)\) are eigen-vectors.

\({\bf V}\) are orthonormal
- \({\bf v}_i^\top {\bf v}_j = 0\) if \(i \ne j\)
- \(\|{\bf v}_g\|^2 = {\bf v}_g^\top {\bf v}_g = 1\), \(\forall 1 \le g \le G\)
\(\Sigma {\bf v}_g = d_g {\bf v}_g\)

Proof (page 2):

For any vector \(\alpha \in \mathbb{R}^{G}\), it can be spanned as linear combination of \({\bf v}_i\)’s

\(\alpha = \sum_{g=1}^G a_g {\bf v}_g\).
- \(\sum_{g=1}^G a_g^2 = 1\) if \(\|\alpha\|^2 = 1\)
When \(\|\alpha\| = 1\),

\(\begin{aligned} Var(\alpha^\top {\bf x}) &= \alpha^\top \Sigma \alpha \\ &= (\sum_{g=1}^G a_g {\bf v}_g)^\top \Sigma (\sum_{g=1}^G a_g {\bf v}_g) \\ &= (\sum_{g=1}^G a_g {\bf v}_g)^\top(\sum_{g=1}^G a_g d_g {\bf v}_g) \\ &= \sum_{g=1}^G d_g a_g^2 \|{\bf v}_g\|^2 = \sum_{g=1}^G d_g a_g^2 \\ &\le \sum_{g=1}^G d_1 a_g^2 \\ &\le d_1 \\ \end{aligned}\)

Proof (page 3):

If \(\alpha = {\bf v}_1\),

\(\begin{aligned} Var(\alpha^\top {\bf x}) &= \alpha^\top \Sigma \alpha \\ &= {\bf v}_1^\top \Sigma {\bf v}_1 \\ &= d_1 \\ \end{aligned}\)

Therefore, \(\alpha = {\bf v}_1\) maximize \(\max_\alpha Var(\alpha^\top {\bf x})\).
And the projected random variable \(L_1 = {\bf v}_1^\top {\bf x}\) has the largest variance \(d_1\)

Similarly, \(L_g = {\bf v}_g^\top {\bf x}\) has the largest variance among all possible projections orthogonal to \(L_1, L_2, \ldots, L_{g-1}\) for \(1 \le g \le G\).

Variance explained

\(trace(\Sigma)\) is sum of variance of all covariates, is invariant of change of basis.
\(trace(\Sigma) = \sum_{g} Var(L_g) = \sum_g d_g\).
Define \(r_g = \frac{d_g}{\sum_{g'} d_{g'}}\) where \(r_g\) represents the proportion of variance explained by the \(g^{th}\) PC.
\(R_g = \sum_{g'=1}^g r_{g'}\) is the cummulative proportion of variance explained by the first \(g\) PCs.

PCA summary

Project the original data into orthonormal space (\(\bf v_1, v_2, \ldots\)) such that the variance of the projected value along \(\bf v_1\) is maximized, then along \(\bf v_2\) is maximized, …
\(L_1 = {\bf v_1}^\top {\bf x}\) is called the first principal component, \(L_2 = {\bf v_2}^\top {\bf x}\) is called the second principal component…
These projection directions are the eigenvectors by applying eigenvalue decomposition to the covariance matrix.
The proportion of variance explained by the first \(g\) PCs is \(r_g = \frac{d_g}{\sum_{g'} d_{g'}}\).

Example 1, HAPMAP data

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation.
A subset of HAPMAP data:
- CEU Utah residents with Northern and Western European ancestry from the CEPH collection
- CHB Han Chinese in Beijing, China
- JPT Japanese in Tokyo, Japan
- YRI Yoruba in Ibadan, Nigeria
712,940 SNPs:
- AA: 0
- AB: 0.5
- BB: 1

HAPMAP data visualization

In class exercise (Visualize HAPMAP data using chromosome 1)

Human being have 23 chromosomes.
Chromosome 1 of HAPMAP data is avaliable on /ufrc/phc6068/share/data/HAPMAP/chr1Data.Rdata
Annotation is on /ufrc/phc6068/share/data/HAPMAP/relationships_w_pops_121708.txt
These data are avaliable on hpg2.rc.ufl.edu
You can either perform the analysis on HiperGator, or downloaded the data to your local computer and perform the PCA analysis.

PCA steps (sample version)

Center the data to mean 0 for each feature. (Sometimes also standardized the varaince to be 1).
Calculate the sample covariance matrix.
Perform eigen-value decomposition and obtain eigen-values and eigen-vectors.
Project the original data onto the first eigen-vector,
- the resulting projected value on the first eigen-vector is called first principal component.
- the first eigen-value is the total variance explained by the first principal component.
Repeat the projection step for the \(k^{th}\) eigen-value and eigen-vectors, \((1 \le k \le K)\)
- the direction of \(k^{th}\) eigen-vector is perpendicular to all previous eigen-vectors.
Select \(K\) using scree plot, or total variance is greater than certain threshold (\(>90\%\))

How many principal component to use?

Scree plot

Example 2, yeast cycle data

The yeast cell cycle analysis project’s goal is to identify all genes whose mRNA levels are regulated by the cell cycle. http://genome-www.stanford.edu/cellcycle/
contains the expression profile of 800 genes and their annotated cell cycle stage
1. G1
2. S
3. S/G2
4. G2/M
5. M/G1
Yeast cells were first synchronized to the same G0 stage by four different chemicals:
- alpha
- cdc15
- cdc28
- elu

Example 2, yeast cycle data PCA for genes

Example 2, yeast cycle data PCA for samples

Will be on HW

Repeat these two PCA plot

library(impute)
raw = read.table(url("http://Caleb-Huo.github.io/teaching/data/cellCycle/cellCycle.txt"),header=TRUE,as.is=TRUE)
cellCycle = raw
cellCycle[,2:78]<- impute.knn(as.matrix(raw[,2:78]))$data ## missing value imputation

Singular value decomposition (SVD)

\(X \in \mathbb{R}^{n \times p}\) is our data matrix with \(n\) samples and \(p\) features.
- Without loss of generosity, \(n < p\)
\(X\) can be factorized as \(U E_0 V_0^\top\)
- \(U \in \mathbb{R}^{n \times n}\) such that \(UU^\top = I_n\).
- \(E_0 \in \mathbb{R}^{n \times p}\). With \(E_0 = (E, {\bf 0}_{n \times (p-n)})\), \(E\in \mathbb{R}^{n \times n}\) is a diagnol matrix.
- \(V_0 \in \mathbb{R}^{p \times p}\). With \(V_0 = (V, {\bf 0}_{p \times (p-n)})\), \(V\in \mathbb{R}^{p \times n}\) such that \(V^\top V= I_n\).
Equivalently, \(X = U E V^\top\)
- where \(E = \begin{pmatrix} e_1 & 0 & 0 & \\ 0 & e_2 & & \\ 0 & & \ddots & 0\\ & & 0 & e_n \end{pmatrix},\)

Equivalence between PCA and SVD

\(\begin{aligned} V^\top X^\top X V &= V^\top (U E V^\top)^\top U E V^\top V \\ &= V^\top V E U^\top U E V^\top V \\ &= E^2 \end{aligned}\)

Performing eigen-value decomposition on \(X^\top X\) is equivalent to perform SVD on \(X\)
- \(V\) are eigen-vectors (projected directions)
- \(E^2\) are eigen-values (explained variances)
- Projected value \(XV = U E V^\top V = U E\)

Validate the SVD results using iris data

iris.data <- iris[,1:4]
iris.data.center <- scale(iris.data)
asvd <- svd(iris.data.center)
UE <- asvd$u %*% diag(asvd$d)
PC1 <- UE[,1]
PC2 <- UE[,2]
variance <- asvd$d^2 / sum(asvd$d^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2)
legend("topright", legend = levels(iris$Species), col =  unique(iris$Species), pch = 19)

Multi dimensional Scaling (MDS)

Multi-dimensional scaling (MDS) aims to map data (distance or dissimilarity) structure to low dimensional space (usually 2D or 3D Euclidian space).
Flight mileage of ten cities obtained from http://www.webyer.com/travel/mileage_calculator/.

z <- read.delim(url("https://Caleb-Huo.github.io/teaching/data/UScityDistance/Mileage.txt"), row.names=1)
knitr::kable(z, caption = "distance between cities (miles)")

distance between cities (miles)
	BOST	NY	DC	MIAM	CHIC	SEAT	SF	LA	DENV	GAINESVILLE
BOSTON	0	206	429	1504	963	2976	3095	2979	1949	1219
NY	206	0	233	1308	802	2815	2934	2786	1771	1000
DC	429	233	0	1075	671	2684	2799	2631	1616	780
MIAMI	1504	1308	1075	0	1329	3273	3053	2687	1010	338
CHICAGO	963	802	671	1329	0	2013	2142	2054	996	1047
SEATTLE	2976	2815	2684	3273	2013	0	808	1131	1307	2985
SF	3095	2934	2799	3053	2142	808	0	379	1235	2780
LA	2979	2786	2631	2687	2054	1131	379	0	1059	2409
DENVER	1949	1771	1616	2037	996	1307	1235	1059	0	1740
GAINESVILLE	1219	1000	780	338	1047	2985	2780	2409	1740	0

Multi dimensional Scaling (MDS)

after apply MDS

Will be on HW

Perform MDS on the following data, with both classical method and Sammon’s stress.

Math behind MDS

objective function for classical MDS \[ L = \min \sum_{i<j} (d_{ij} - \delta_{ij})^2\]
parameters:
- \(i, j\): sample index
- \(\delta_{ij}\): the dissimilarity measure between object i and j in the original data space.
- \(d_{ij}\): the distance between the two objects after mapping to the targeted low-dimensional space.
In the loss function above, large distances can dominate the optimization and ignore the local structure for pairs of short distances.
- A possible modification may be to minimize the percent of squared loss. \[ L = \min \sum_{i<j} (\frac{d_{ij} - \delta_{ij}}{\delta_{ij}})^2 \]

MDS with Sammon’s stress

This modification, however, may over-emphasize the local structure and easily distort the global structure.
A better balance between the two is the Sammon’s stress. \[ L = \min (\frac{d_{ij} - \delta_{ij}}{\delta_{ij}})^2 \times w_{ij},\] \(w_{ij} = \delta_{ij} / (\sum_{i<j} \delta_{ij})\)

Note:

Reflection, translation and/or rotation (i.e. isometry) of an MDS solution is also an MDS solution since MDS only considers the preservation of the dissimilarity structure.

Classical mulditimensional scaling

\(\delta_{ij}\) is the observed distance between sample \(i\) and \(j\), (\(1\le i \le n\), \(1\le j \le n\)).
\(D = \{d_{ij}\}_{1\le i,j \le n}\) is derived from Euclidean distance of an unnkown \(n \times q\) data matrix \(X \in \mathbb{n\times q}\) (Usually \(q=2\)).
- \(\|X_i - X_j\| = d_{ij}\).
- Such a solution is not unique, because if \(X\) is the solution, \(X^* = X + c\), \(c \in \mathbb{q}\) is also solution, since \(\|X_i^* - X_j^*\| = \|(X_i + c) - (X_j + c)\| = \|X_i - X_j\|\)
- So suppose \(X_i\) is centered, (i.e. \(\sum_{i=1}^n X_{iq} = 0\), for all \(q\)).
- \(B = XX^\top\), \(B_{ij} = b_{ij}\)
The following relationship can be derived
- \(d_{ij}^2 = b_{ii} + b_{jj} - 2b_{ij}\)
- \(\sum_{i=1}^n b_{ij} = 0\)
- \(T = trace(B) = \sum_{i=1}^n b_{ii}\)
- \(\sum_{i=1}^n d_{ij}^2 = T + nb_{jj}\)
- \(\sum_{j=1}^n d_{ij}^2 = T + nb_{ii}\)
- \(\sum_{i=1}^n \sum_{j=1}^n d_{ij}^2 = 2 n T\)

Classical mulditimensional scaling continue

We can solve for: \[b_{ij} = -1/2 (d_{ij}^2 - d{\cdot j}^2 - d{i\cdot}^2 + d_{\cdot\cdot}^2)\]
- \(d_{.j} = \frac{1}{n} \sum_{i = 1}^n d_{ij}\)
- \(d_{i.} = \frac{1}{n} \sum_{j = 1}^n d_{ij}\)
- \(d_{..} = \frac{1}{n^2} \sum_{i = 1}^n \sum_{j = 1}^n d_{ij}\)
Use \(\delta_{ij}\) to approximate \(d_{ij}\).
A solution \(X\) is then given by the eigen-decomposition of \(B\).
- \(B = V\Lambda V^\top\)
- \(X = \Lambda^{1/2} V^\top\)

MDS in R

classical: cmdscale()
MDS with Sammon’s stress: MASS::sammon()

Another MDS example - Genetic dissimilarity between races

z0 <- read.csv(url("https://Caleb-Huo.github.io/teaching/data/SNP_MDS/SNP_PCA.csv"), row.names=1)
knitr::kable(z0, caption = "Genetic dissimilarity between races")

Genetic dissimilarity between races
	X.Kung	Alur	AndhraBrahmin	CEU	CHB	Chinese	Dalit	Hema	Iban	Irula	JPT	Japanese	Khmer.Cambodian	Luhya	Madiga	Mala	Nguni	Pedi	Pygmy	Sotho.Tswana	Stalskoe	Tamil.Brahmin	Tuscan	Urkarah	Utah.N..European	Vietnamese
!Kung	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Alur	0.050	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Andhra_Brahmin	0.168	0.135	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
CEU	0.183	0.154	0.033	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
CHB	0.216	0.187	0.075	0.110	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Chinese	0.213	0.179	0.071	0.108	0.001	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Dalit	0.178	0.143	0.010	0.056	0.078	0.075	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Hema	0.052	0.013	0.108	0.123	0.162	0.153	0.116	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Iban	0.216	0.185	0.075	0.112	0.024	0.020	0.078	0.160	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Irula	0.193	0.160	0.030	0.073	0.091	0.089	0.031	0.133	0.092	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
JPT	0.218	0.188	0.076	0.112	0.007	0.009	0.079	0.163	0.030	0.092	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Japanese	0.212	0.179	0.072	0.108	0.008	0.009	0.076	0.153	0.031	0.089	0.003	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Khmer Cambodian	0.199	0.162	0.055	0.094	0.012	0.009	0.059	0.136	0.013	0.075	0.018	0.018	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Luhya	0.049	0.009	0.131	0.148	0.180	0.171	0.137	0.011	0.177	0.153	0.181	0.171	0.156	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Madiga	0.174	0.138	0.005	0.050	0.072	0.069	0.006	0.111	0.073	0.027	0.073	0.070	0.052	0.133	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Mala	0.174	0.138	0.005	0.053	0.072	0.068	0.006	0.112	0.072	0.026	0.073	0.069	0.051	0.133	0.001	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Nguni	0.038	0.019	0.146	0.164	0.198	0.191	0.154	0.022	0.197	0.172	0.199	0.190	0.174	0.014	0.150	0.150	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Pedi	0.036	0.016	0.140	0.157	0.192	0.184	0.147	0.018	0.190	0.165	0.193	0.184	0.167	0.010	0.143	0.143	0.004	NA	NA	NA	NA	NA	NA	NA	NA	NA
Pygmy	0.071	0.065	0.196	0.209	0.239	0.240	0.207	0.077	0.240	0.219	0.240	0.238	0.230	0.073	0.204	0.203	0.075	0.072	NA	NA	NA	NA	NA	NA	NA	NA
Sotho/Tswana	0.029	0.020	0.144	0.161	0.196	0.189	0.152	0.022	0.195	0.169	0.197	0.188	0.172	0.015	0.147	0.147	0.004	0.004	0.072	NA	NA	NA	NA	NA	NA	NA
Stalskoe	0.180	0.142	0.023	0.012	0.104	0.101	0.046	0.110	0.108	0.066	0.105	0.101	0.086	0.136	0.040	0.042	0.153	0.145	0.212	0.150	NA	NA	NA	NA	NA	NA
Tamil_Brahmin	0.170	0.135	0.001	0.033	0.078	0.074	0.010	0.107	0.080	0.032	0.080	0.075	0.058	0.130	0.006	0.006	0.146	0.139	0.199	0.143	0.023	NA	NA	NA	NA	NA
Tuscan	0.179	0.148	0.032	0.005	0.112	0.109	0.055	0.116	0.113	0.072	0.113	0.109	0.094	0.141	0.048	0.051	0.159	0.151	0.208	0.156	0.010	0.031	NA	NA	NA	NA
Urkarah	0.183	0.150	0.028	0.014	0.109	0.106	0.051	0.119	0.111	0.069	0.111	0.105	0.091	0.144	0.044	0.047	0.161	0.154	0.212	0.158	0.007	0.027	0.013	NA	NA	NA
Utah N. European	0.184	0.152	0.033	0.000	0.112	0.109	0.057	0.121	0.114	0.074	0.114	0.109	0.095	0.146	0.051	0.052	0.163	0.156	0.212	0.160	0.012	0.033	0.004	0.014	NA	NA
Vietnamese	0.210	0.175	0.066	0.104	0.006	0.002	0.071	0.148	0.014	0.086	0.014	0.015	0.003	0.167	0.064	0.064	0.186	0.180	0.238	0.185	0.099	0.070	0.105	0.102	0.105	NA
YRI	0.051	0.015	0.141	0.157	0.184	0.177	0.146	0.018	0.181	0.161	0.185	0.177	0.164	0.009	0.142	0.143	0.014	0.011	0.076	0.016	0.147	0.140	0.151	0.154	0.155	0.174

Another MDS example - Genetic dissimilarity between races

library(ggplot2)
library(ggrepel)

z <- as.dist(cbind(z0,NA))
z_mds = - cmdscale(dist(z),k=2)

mdsDataFrame <- data.frame(Race=rownames(z_mds),x = z_mds[,1], y = z_mds[,2])

ggplot(data=mdsDataFrame, aes(x=x, y=y, label=Race)) + geom_point(aes(color=Race)) + geom_text_repel(data=mdsDataFrame, aes(label=Race))

Non negetive matrix factorization (NMF)

Non-negative matrix factorization (NMF) is a dimension reduction algorithm where a non-negative matrix, X, is factorized into two non-negative matrices
- W and H, all elements must be equal to or greater than zero.
- The method has been applied in image recognition, text mining and bioinformatics.

\[\min_{W\in \mathbb{R}^{p\times r}, H \in \mathbb{R}^{r \times n}} \|X - WH\|_F,\]
where \(X \in \mathbb{R}^{p \times n}\), \(\|X\|_F = \sqrt{\sum_{j=1}^p \sum_{i=1}^n X_{ji}^2}\)

NMF example

library(NMF)

## Loading required package: pkgmaker

## Loading required package: registry

## 
## Attaching package: 'pkgmaker'

## The following object is masked from 'package:base':
## 
##     isNamespaceLoaded

## Loading required package: rngtools

## Loading required package: cluster

## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 3/4

##   To enable shared memory capabilities, try: install.extras('
## NMF
## ')

iris.data <- iris[,1:4]
anmf <- nmf(iris.data, rank = 2, method = "lee")

W <- anmf@fit@W
H <- anmf@fit@H

plot(W[,1], W[,2], col=as.numeric(iris$Species),pch=19)
legend("topright", legend = levels(iris$Species), col =  unique(iris$Species), pch = 19)

NMF example

basismap(anmf) ## W

coefmap(anmf) ## H

knitr::purl("SVDandPCA.Rmd", output = "SVDandPCA.R ", documentation = 2)

## 
## 
## processing file: SVDandPCA.Rmd

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |...                                                              |   5%
  |                                                                       
  |......                                                           |   9%
  |                                                                       
  |.........                                                        |  14%
  |                                                                       
  |............                                                     |  18%
  |                                                                       
  |...............                                                  |  23%
  |                                                                       
  |..................                                               |  27%
  |                                                                       
  |.....................                                            |  32%
  |                                                                       
  |........................                                         |  36%
  |                                                                       
  |...........................                                      |  41%
  |                                                                       
  |..............................                                   |  45%
  |                                                                       
  |................................                                 |  50%
  |                                                                       
  |...................................                              |  55%
  |                                                                       
  |......................................                           |  59%
  |                                                                       
  |.........................................                        |  64%
  |                                                                       
  |............................................                     |  68%
  |                                                                       
  |...............................................                  |  73%
  |                                                                       
  |..................................................               |  77%
  |                                                                       
  |.....................................................            |  82%
  |                                                                       
  |........................................................         |  86%
  |                                                                       
  |...........................................................      |  91%
  |                                                                       
  |..............................................................   |  95%
  |                                                                       
  |.................................................................| 100%

## output file: SVDandPCA.R

## [1] "SVDandPCA.R "

Biostatistical Computing, PHC 6068

Dimension reduction

Outlines

Principal component analysis

Which direction to project?

Another motivating example

Biplot

Geometrical motivation for PCA (population version)

Proof (page 1):

Proof (page 2):

Proof (page 3):

PCA 2D example

Variance explained

PCA summary

Example 1, HAPMAP data

HAPMAP data visualization

In class exercise (Visualize HAPMAP data using chromosome 1)

PCA steps (sample version)

How many principal component to use?

Example 2, yeast cycle data

Example 2, yeast cycle data PCA for genes

Example 2, yeast cycle data PCA for samples

Will be on HW

Singular value decomposition (SVD)

Equivalence between PCA and SVD

Validate the SVD results using iris data

Multi dimensional Scaling (MDS)

Multi dimensional Scaling (MDS)

Will be on HW

Math behind MDS

MDS with Sammon’s stress

Classical mulditimensional scaling

Classical mulditimensional scaling continue

MDS in R

Another MDS example - Genetic dissimilarity between races

Another MDS example - Genetic dissimilarity between races

Non negetive matrix factorization (NMF)

NMF example

NMF example