Zhiguang Huo (Caleb)
Monday October 11, 2021
Several popular dimension reduction approaches
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## [1] 150 5
iris.data <- iris[,1:4]
ir.pca <- prcomp(iris.data,
center = TRUE,
scale = TRUE)
PC1 <- ir.pca$x[,"PC1"]
PC2 <- ir.pca$x[,"PC2"]
variance <- ir.pca$sdev^2 / sum(ir.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2)
legend("topright", legend = levels(iris$Species), col = unique(iris$Species), pch = 19)
suppressWarnings(suppressMessages(library(tidyverse)))
iris.data <- iris[,1:4]
ir.pca <- prcomp(iris.data,
center = TRUE,
scale = TRUE)
PC1 <- ir.pca$x[,"PC1"]
PC2 <- ir.pca$x[,"PC2"]
data_PCA <- tibble(PC1=PC1, PC2=PC2, Species = iris$Species)
variance <- ir.pca$sdev^2 / sum(ir.pca$sdev^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
black.bold.text <- element_text(face = "bold", color = "black", size=20)
data_PCA %>%
ggplot() +
aes(x=PC1,y=PC2,color=Species) +
geom_point() +
labs(x = v1, y=v2) +
theme_bw() +
theme(text = black.bold.text)
where \(D = \begin{pmatrix} d_1 & 0 & 0 & \\ 0 & d_2 & & \\ 0 & & \ddots & 0\\ & & 0 & d_G \end{pmatrix},\)
\(d_1 \ge d_2 \ge \ldots \ge d_G \ge 0\) are eigenvalues. \({\bf V} = ({\bf v}_1, {\bf v}_2, \ldots, {\bf v}_G)\) are eigen-vectors.
For any vector \(\alpha \in \mathbb{R}^{G}\), it can be spanned as linear combination of \({\bf v}_i\)’s
\(\begin{aligned} Var(\alpha^\top {\bf x}) &= \alpha^\top \Sigma \alpha \\ &= (\sum_{g=1}^G a_g {\bf v}_g)^\top \Sigma (\sum_{g=1}^G a_g {\bf v}_g) \\ &= (\sum_{g=1}^G a_g {\bf v}_g)^\top(\sum_{g=1}^G a_g d_g {\bf v}_g) \\ &= \sum_{g=1}^G d_g a_g^2 \|{\bf v}_g\|_2^2 = \sum_{g=1}^G d_g a_g^2 \\ &\le \sum_{g=1}^G d_1 a_g^2 \\ &\le d_1 \\ \end{aligned}\)
\(\begin{aligned} Var(\alpha^\top {\bf x}) &= \alpha^\top \Sigma \alpha \\ &= {\bf v}_1^\top \Sigma {\bf v}_1 \\ &= d_1 \\ \end{aligned}\)
Therefore, \(\alpha = {\bf v}_1\) maximize \(\max_\alpha Var(\alpha^\top {\bf x})\).
And the projected random variable \(L_1 = {\bf v}_1^\top {\bf x}\) has the largest variance \(d_1\)
\({\bf v}_1 = (v_{11}, \ldots, v_{1G})\) are called loadings of PC1.
Similarly, \(L_2 = {\bf v}_2^\top {\bf x}\) has the largest variance among all possible projections orthogonal to \(L_1\).
Similarly, \(L_g = {\bf v}_g^\top {\bf x}\) has the largest variance among all possible projections orthogonal to \(L_1, L_2, \ldots, L_{g-1}\) for \(1 \le g \le G\).
Human being have 23 pairs of chromosomes.
Chromosome 1 of HAPMAP data is avaliable on /blue/phc6068/share/data/HAPMAP/chr1Data.Rdata
Annotation is on /blue/phc6068/share/data/HAPMAP/relationships_w_pops_121708.txt
These data are avaliable on hpg.rc.ufl.edu
You can either perform the analysis on HiperGator, or downloaded the data to your local computer and perform the PCA analysis.
The PCA has been implemented in many R packages. You don’t have to implement these individual steps.
Covariance matrix \(\Sigma = Var(x)\) \[\Sigma = V D V^\top, or \mbox{ } V^\top \Sigma V= D\]
\(v_1^\top x\) is the score of x on the first principal component
\(\begin{aligned} V^\top X^\top X V &= V^\top (U E V^\top)^\top U E V^\top V \\ &= V^\top V E U^\top U E V^\top V \\ &= E^2 \end{aligned}\)
iris.data <- iris[,1:4]
iris.data.center <- scale(iris.data) ## mean 0 and std 1
asvd <- svd(iris.data.center)
UE <- asvd$u %*% diag(asvd$d)
PC1 <- UE[,1]
PC2 <- UE[,2]
variance <- asvd$d^2 / sum(asvd$d^2)
v1 <- paste0("variance: ",signif(variance[1] * 100,3), "%")
v2 <- paste0("variance: ",signif(variance[2] * 100,3), "%")
plot(PC1, PC2, col=as.numeric(iris$Species),pch=19, xlab=v1, ylab=v2)
legend("topright", legend = levels(iris$Species), col = unique(iris$Species), pch = 19)
\[\min_{W\in \mathbb{R}^{p\times r}, H \in \mathbb{R}^{r \times n}} \|X - WH\|_F,\]
where \(X \in \mathbb{R}^{p \times n}\), \(\|X\|_F = \sqrt{\sum_{j=1}^p \sum_{i=1}^n X_{ji}^2}\)
z <- read.delim("https://Caleb-Huo.github.io/teaching/data/UScityDistance/Mileage.txt", row.names=1)
knitr::kable(z, caption = "distance between cities (miles)")
BOST | NY | DC | MIAM | CHIC | SEAT | SF | LA | DENV | GAINESVILLE | |
---|---|---|---|---|---|---|---|---|---|---|
BOSTON | 0 | 206 | 429 | 1504 | 963 | 2976 | 3095 | 2979 | 1949 | 1219 |
NY | 206 | 0 | 233 | 1308 | 802 | 2815 | 2934 | 2786 | 1771 | 1000 |
DC | 429 | 233 | 0 | 1075 | 671 | 2684 | 2799 | 2631 | 1616 | 780 |
MIAMI | 1504 | 1308 | 1075 | 0 | 1329 | 3273 | 3053 | 2687 | 1010 | 338 |
CHICAGO | 963 | 802 | 671 | 1329 | 0 | 2013 | 2142 | 2054 | 996 | 1047 |
SEATTLE | 2976 | 2815 | 2684 | 3273 | 2013 | 0 | 808 | 1131 | 1307 | 2985 |
SF | 3095 | 2934 | 2799 | 3053 | 2142 | 808 | 0 | 379 | 1235 | 2780 |
LA | 2979 | 2786 | 2631 | 2687 | 2054 | 1131 | 379 | 0 | 1059 | 2409 |
DENVER | 1949 | 1771 | 1616 | 2037 | 996 | 1307 | 1235 | 1059 | 0 | 1740 |
GAINESVILLE | 1219 | 1000 | 780 | 338 | 1047 | 2985 | 2780 | 2409 | 1740 | 0 |
objective function for classical MDS \[ L = \min \sum_{i<j} (d_{ij} - \delta_{ij})^2\] Assume the underlying data \(X \in \mathbb{n\times q}\) (Usually \(q=2\)) and \(\|X_i - X_j\|_2 = d_{ij}\).
Note:
z0 <- read.csv("https://Caleb-Huo.github.io/teaching/data/SNP_MDS/SNP_PCA.csv", row.names=1)
knitr::kable(z0, caption = "Genetic dissimilarity between races")
X.Kung | Alur | AndhraBrahmin | CEU | CHB | Chinese | Dalit | Hema | Iban | Irula | JPT | Japanese | Khmer.Cambodian | Luhya | Madiga | Mala | Nguni | Pedi | Pygmy | Sotho.Tswana | Stalskoe | Tamil.Brahmin | Tuscan | Urkarah | Utah.N..European | Vietnamese | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
!Kung | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Alur | 0.050 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Andhra_Brahmin | 0.168 | 0.135 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CEU | 0.183 | 0.154 | 0.033 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
CHB | 0.216 | 0.187 | 0.075 | 0.110 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Chinese | 0.213 | 0.179 | 0.071 | 0.108 | 0.001 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Dalit | 0.178 | 0.143 | 0.010 | 0.056 | 0.078 | 0.075 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Hema | 0.052 | 0.013 | 0.108 | 0.123 | 0.162 | 0.153 | 0.116 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Iban | 0.216 | 0.185 | 0.075 | 0.112 | 0.024 | 0.020 | 0.078 | 0.160 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Irula | 0.193 | 0.160 | 0.030 | 0.073 | 0.091 | 0.089 | 0.031 | 0.133 | 0.092 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
JPT | 0.218 | 0.188 | 0.076 | 0.112 | 0.007 | 0.009 | 0.079 | 0.163 | 0.030 | 0.092 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Japanese | 0.212 | 0.179 | 0.072 | 0.108 | 0.008 | 0.009 | 0.076 | 0.153 | 0.031 | 0.089 | 0.003 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Khmer Cambodian | 0.199 | 0.162 | 0.055 | 0.094 | 0.012 | 0.009 | 0.059 | 0.136 | 0.013 | 0.075 | 0.018 | 0.018 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Luhya | 0.049 | 0.009 | 0.131 | 0.148 | 0.180 | 0.171 | 0.137 | 0.011 | 0.177 | 0.153 | 0.181 | 0.171 | 0.156 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Madiga | 0.174 | 0.138 | 0.005 | 0.050 | 0.072 | 0.069 | 0.006 | 0.111 | 0.073 | 0.027 | 0.073 | 0.070 | 0.052 | 0.133 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Mala | 0.174 | 0.138 | 0.005 | 0.053 | 0.072 | 0.068 | 0.006 | 0.112 | 0.072 | 0.026 | 0.073 | 0.069 | 0.051 | 0.133 | 0.001 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Nguni | 0.038 | 0.019 | 0.146 | 0.164 | 0.198 | 0.191 | 0.154 | 0.022 | 0.197 | 0.172 | 0.199 | 0.190 | 0.174 | 0.014 | 0.150 | 0.150 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Pedi | 0.036 | 0.016 | 0.140 | 0.157 | 0.192 | 0.184 | 0.147 | 0.018 | 0.190 | 0.165 | 0.193 | 0.184 | 0.167 | 0.010 | 0.143 | 0.143 | 0.004 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Pygmy | 0.071 | 0.065 | 0.196 | 0.209 | 0.239 | 0.240 | 0.207 | 0.077 | 0.240 | 0.219 | 0.240 | 0.238 | 0.230 | 0.073 | 0.204 | 0.203 | 0.075 | 0.072 | NA | NA | NA | NA | NA | NA | NA | NA |
Sotho/Tswana | 0.029 | 0.020 | 0.144 | 0.161 | 0.196 | 0.189 | 0.152 | 0.022 | 0.195 | 0.169 | 0.197 | 0.188 | 0.172 | 0.015 | 0.147 | 0.147 | 0.004 | 0.004 | 0.072 | NA | NA | NA | NA | NA | NA | NA |
Stalskoe | 0.180 | 0.142 | 0.023 | 0.012 | 0.104 | 0.101 | 0.046 | 0.110 | 0.108 | 0.066 | 0.105 | 0.101 | 0.086 | 0.136 | 0.040 | 0.042 | 0.153 | 0.145 | 0.212 | 0.150 | NA | NA | NA | NA | NA | NA |
Tamil_Brahmin | 0.170 | 0.135 | 0.001 | 0.033 | 0.078 | 0.074 | 0.010 | 0.107 | 0.080 | 0.032 | 0.080 | 0.075 | 0.058 | 0.130 | 0.006 | 0.006 | 0.146 | 0.139 | 0.199 | 0.143 | 0.023 | NA | NA | NA | NA | NA |
Tuscan | 0.179 | 0.148 | 0.032 | 0.005 | 0.112 | 0.109 | 0.055 | 0.116 | 0.113 | 0.072 | 0.113 | 0.109 | 0.094 | 0.141 | 0.048 | 0.051 | 0.159 | 0.151 | 0.208 | 0.156 | 0.010 | 0.031 | NA | NA | NA | NA |
Urkarah | 0.183 | 0.150 | 0.028 | 0.014 | 0.109 | 0.106 | 0.051 | 0.119 | 0.111 | 0.069 | 0.111 | 0.105 | 0.091 | 0.144 | 0.044 | 0.047 | 0.161 | 0.154 | 0.212 | 0.158 | 0.007 | 0.027 | 0.013 | NA | NA | NA |
Utah N. European | 0.184 | 0.152 | 0.033 | 0.000 | 0.112 | 0.109 | 0.057 | 0.121 | 0.114 | 0.074 | 0.114 | 0.109 | 0.095 | 0.146 | 0.051 | 0.052 | 0.163 | 0.156 | 0.212 | 0.160 | 0.012 | 0.033 | 0.004 | 0.014 | NA | NA |
Vietnamese | 0.210 | 0.175 | 0.066 | 0.104 | 0.006 | 0.002 | 0.071 | 0.148 | 0.014 | 0.086 | 0.014 | 0.015 | 0.003 | 0.167 | 0.064 | 0.064 | 0.186 | 0.180 | 0.238 | 0.185 | 0.099 | 0.070 | 0.105 | 0.102 | 0.105 | NA |
YRI | 0.051 | 0.015 | 0.141 | 0.157 | 0.184 | 0.177 | 0.146 | 0.018 | 0.181 | 0.161 | 0.185 | 0.177 | 0.164 | 0.009 | 0.142 | 0.143 | 0.014 | 0.011 | 0.076 | 0.016 | 0.147 | 0.140 | 0.151 | 0.154 | 0.155 | 0.174 |
library(ggplot2)
library(ggrepel)
z <- as.dist(cbind(z0,NA))
z_mds = - cmdscale(z,k=2)
mdsDataFrame <- data.frame(Race=rownames(z_mds),x = z_mds[,1], y = z_mds[,2])
ggplot(data=mdsDataFrame, aes(x=x, y=y, label=Race)) +
geom_point(aes(color=Race)) +
geom_text_repel(data=mdsDataFrame, aes(label=Race)) +
theme(legend.position = "none")
## Warning in if (class(X) == "dist") {: the condition has length > 1 and only the
## first element will be used
## sigma summary: Min. : 0.486505661043274 |1st Qu. : 0.587913800179832 |Median : 0.614872437640536 |Mean : 0.623051089344394 |3rd Qu. : 0.654914112723525 |Max. : 0.796707932771489 |
## Epoch: Iteration #100 error is: 12.990063489305
## Epoch: Iteration #200 error is: 0.277712292102509
## Epoch: Iteration #300 error is: 0.257101580790775
## Epoch: Iteration #400 error is: 0.246665630776044
## Epoch: Iteration #500 error is: 0.224003366859357
## Epoch: Iteration #600 error is: 0.223580775226078
## Epoch: Iteration #700 error is: 0.223572977457498
## Epoch: Iteration #800 error is: 0.223572819069482
## Epoch: Iteration #900 error is: 0.223572816097557
## Epoch: Iteration #1000 error is: 0.223572816039835