协变量是PCA的前三个,数据具体如下:
「表型数据:」
1641 1641 153.973423
1642 1642 113.119301
1643 1643 77.094801
1644 1644 89.293866
1645 1645 94.511433
1646 1646 134.410909
1647 1647 121.246759
1648 1648 45.699443
1649 1649 67.786229
1650 1650 102.225648
「协变量数据」
1641 1641 0.0633225 0.0285328 -0.00734127
1642 1642 0.0684039 0.0212648 -0.000664363
1643 1643 0.0609595 0.0242615 -0.00206211
1644 1644 0.0636631 0.0241012 -0.00275062
1645 1645 0.0741456 0.0293644 0.00068775
1646 1646 0.114417 0.0443451 -0.0408687
1647 1647 0.111599 0.0400401 -0.0320522
1648 1648 0.100862 0.0344925 -0.0386237
1649 1649 0.107986 0.028411 -0.0312307
1650 1650 0.108836 0.0377951 -0.0314308
「基因型数据:」
FID IID PAT MAT SEX PHENOTYPE M4_A M6_T M9_T M11_A M17_A M19_A M22_A M24_A M27_A M31_
1641 1641 0 0 0 -9 0 1 2 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1
1642 1642 0 0 0 -9 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1643 1643 0 0 0 -9 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1
1644 1644 0 0 0 -9 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1
1645 1645 0 0 0 -9 0 1 2 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1
1646 1646 0 0 0 -9 1 0 0 0 0 0 0 2 2 2 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
1647 1647 0 0 0 -9 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
1648 1648 0 0 0 -9 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1649 1649 0 0 0 -9 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
1650 1650 0 0 0 -9 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1651 1651 0 0 0 -9 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1652 1652 0 0 0 -9 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
1653 1653 0 0 0 -9 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1654 1654 0 0 0 -9 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
1655 1655 0 0 0 -9 2 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1656 1656 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1657 1657 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1658 1658 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GLM
模型GAPIT软件安装,见:如何安装GAPIT软件:https://zhuanlan.zhihu.com/p/268327005
「分析代码如下:」
library(data.table)
raw = fread("plink.raw",header=T)
raw[1:10,1:10]
map = fread("recode.map",header=F)
head(map)
rr = raw[,!c(1,3:6)]
names(rr) = c("taxa",map$V2)
mm = map[,c(2,1,4)]
names(mm) = c("Name","Chromosome","Position")
head(mm)
rr[1:10,1:10]
fwrite(rr,"mdp_numeric.txt",sep="\t")
fwrite(mm,"mdp_SNP_information.txt",sep="\t")
#
# GWAS 分析
library(data.table)
source("http://zzlab.net/GAPIT/GAPIT.library.R")
source("http://zzlab.net/GAPIT/gapit_functions.txt")
myGd = fread("mdp_numeric.txt",header=T,data.table = F)
myGM = fread("mdp_SNP_information.txt",header = T,data.table=F)
myY = fread("dat_plink.txt",data.table = F)
head(myY)
covar = fread("cov_plink.txt",data.table = F)[,-1]
names(covar)[1] = "Taxa"
head(covar)
myGAPIT = GAPIT(Y = myY[,c(1,3)],GD = myGd, GM = myGM,
# PCA.total=3,
CV = covar,
model="GLM")
「注意:」
GAPIT.GLM.V3.GWAS.Results.csv
结果文件如下:包括:
SNP,Chromosome,Position ,P.value,maf,nobs,Rsquare.of.Model.without.SNP,Rsquare.of.Model.with.SNP,FDR_Adjusted_P-values,effect
M57157,12,44,0.0000000265778189052299,0.375,1000,0.0178325320026741,0.0488405467802565,0.000799726570858368,7.5184135943861
M57155,12,44,0.000000149990719134015,0.3275,1000,0.0178325320026741,0.0454335956044116,0.00115674059672426,7.24232622801506
M57156,12,44,0.000000149990719134015,0.3275,1000,0.0178325320026741,0.0454335956044116,0.00115674059672426,7.24232622801506
M98663,20,73,0.00000015377076726145,0.222,1000,0.0178325320026741,0.0453847644903734,0.00115674059672426,-7.81213756957444
M73233,15,64,0.000000512284630336812,0.2255,1000,0.0178325320026741,0.0430299031398552,0.00290006393781271,-7.52596405927775
M8452,2,69,0.000000578277953701437,0.435,1000,0.0178325320026741,0.0427934752525868,0.00290006393781271,-6.51962588787404
M45998,10,20,0.000000768626039613293,0.411,1000,0.0178325320026741,0.0422387923854602,0.00307979999092427,-6.69120649416338
M82957,17,57,0.00000112940850557263,0.1685,1000,0.0178325320026741,0.0414897712489706,0.00307979999092427,8.48568078650835
M82958,17,57,0.00000112940850557263,0.1685,1000,0.0178325320026741,0.0414897712489706,0.00307979999092427,8.48568078650835
这里,
「上面两者之差,即为该SNP的解释百分比(PVE)」
$$SNP的PVE = Rsquare.of.Model.with.SNP - Rsquare.of.Model.without.SNP$$
首先读入R语言:
r$> re = fread("GAPIT.GLM.V3.GWAS.Results.csv")
r$> head(re)
SNP Chromosome Position P.value maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values effect
1: M57157 12 44 2.657782e-08 0.3750 1000 0.01783253 0.04884055 0.0007997266 7.518414
2: M57155 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326
3: M57156 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326
4: M98663 20 73 1.537708e-07 0.2220 1000 0.01783253 0.04538476 0.0011567406 -7.812138
5: M73233 15 64 5.122846e-07 0.2255 1000 0.01783253 0.04302990 0.0029000639 -7.525964
6: M8452 2 69 5.782780e-07 0.4350 1000 0.01783253 0.04279348 0.0029000639 -6.519626
然后我们根据上面的公式,手动计算PVE值:
r$> re$PVE = re$Rsquare.of.Model.with.SNP - re$Rsquare.of.Model.without.SNP
r$> head(re)
SNP Chromosome Position P.value maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values effect PVE
1: M57157 12 44 2.657782e-08 0.3750 1000 0.01783253 0.04884055 0.0007997266 7.518414 0.03100801
2: M57155 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326 0.02760106
3: M57156 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.242326 0.02760106
4: M98663 20 73 1.537708e-07 0.2220 1000 0.01783253 0.04538476 0.0011567406 -7.812138 0.02755223
5: M73233 15 64 5.122846e-07 0.2255 1000 0.01783253 0.04302990 0.0029000639 -7.525964 0.02519737
6: M8452 2 69 5.782780e-07 0.4350 1000 0.01783253 0.04279348 0.0029000639 -6.519626 0.02496094
根据PVE的大小,进行排序,降序:
r$> re %>% arrange(-PVE)
SNP Chromosome Position P.value maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values effect PVE
1: M57157 12 44 2.657782e-08 0.3750 1000 0.01783253 0.04884055 0.0007997266 7.5184135944 3.100801e-02
2: M57155 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.2423262280 2.760106e-02
3: M57156 12 44 1.499907e-07 0.3275 1000 0.01783253 0.04543360 0.0011567406 7.2423262280 2.760106e-02
4: M98663 20 73 1.537708e-07 0.2220 1000 0.01783253 0.04538476 0.0011567406 -7.8121375696 2.755223e-02
5: M73233 15 64 5.122846e-07 0.2255 1000 0.01783253 0.04302990 0.0029000639 -7.5259640593 2.519737e-02
---
30086: M58296 12 66 9.996606e-01 0.2700 1000 0.01783253 0.01783253 0.9997934933 0.0006040638 1.785355e-10
30087: M751 1 15 9.998043e-01 0.2435 1000 0.01783253 0.01783253 0.9998960441 0.0003736382 5.935820e-11
30088: M21736 5 35 9.998296e-01 0.3020 1000 0.01783253 0.01783253 0.9998960441 0.0002955665 4.500680e-11
30089: M80677 17 12 9.999131e-01 0.3830 1000 0.01783253 0.01783253 0.9999300138 -0.0001448323 1.170910e-11
30090: M57593 12 52 9.999300e-01 0.2735 1000 0.01783253 0.01783253 0.9999300138 -0.0001300807 7.589301e-12
可以看到,结果就给出了PVE从大到小的排序结果。
这里,我们注意到,前三个SNP的PVE分别是:
这里,我们对PVE进行求和:
r$> sum(re$PVE)
[1] 57.6114
可以看到,总的PVE之和为57.611,远远大于1,是因为有些位点高度LD,所以PVE之和就很大。相关问题在 GWAS分析中SNP解释百分比PVE | 第一篇,SNP解释百分比之和为何大于1?中有过介绍。
简单来说,就是单位点的回归分析,计算R方。
这里,我们用同样的数据,在R中进行GLM的GWAS分析。
代码如下:
library(data.table)
geno = fread("plink.raw")[,!c(1,3:6)]
map = fread("recode.map")
m012 = geno
names(m012) = c("IID",map$V2)
phe = fread("phe.txt")
cov1= fread("tcov.txt")
dat = left_join(phe,cov1,by = c("V1"="V1")) %>% dplyr::select(IID =1, y =2, pc1=3,pc2=4,pc3=5)
dat1 = left_join(dat,m012,by="IID")
# dat1$M57157 = as.factor(dat1$M57157)
mod_1 = lm(y ~ 1+pc1 + pc2 + pc3 + M57157,data=dat1);summary(mod_1)
summary(mod_1)$r.squared
mod_2 = lm(y ~ 1+pc1 + pc2 + pc3 ,data=dat1);summary(mod_2)
summary(mod_2)$r.squared
这里,我们将PCA的前三个作为协变量加到回归分析汇总。
包含SNP的回归分析:
r$> mod_1 = lm(y ~ 1+pc1 + pc2 + pc3 + M57157,data=dat1);summary(mod_1)
Call:
lm(formula = y ~ 1 + pc1 + pc2 + pc3 + M57157, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-85.094 -19.174 -0.443 18.218 79.181
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 96.408 1.320 73.048 < 2e-16 ***
pc1 -108.543 27.603 -3.932 9.00e-05 ***
pc2 28.740 28.981 0.992 0.3216
pc3 58.098 27.890 2.083 0.0375 *
M57157 7.518 1.320 5.695 1.62e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 27.6 on 995 degrees of freedom
Multiple R-squared: 0.04884, Adjusted R-squared: 0.04502
F-statistic: 12.77 on 4 and 995 DF, p-value: 3.838e-10
可以看到,R方为0.04884,也可以这样提取:
r$> summary(mod_1)$r.squared
[1] 0.04884055
不包含SNP的回归分析:
r$> mod_2 = lm(y ~ 1+pc1 + pc2 + pc3 ,data=dat1);summary(mod_2)
Call:
lm(formula = y ~ 1 + pc1 + pc2 + pc3, data = dat1)
Residuals:
Min 1Q Median 3Q Max
-89.495 -19.318 -0.099 18.649 85.448
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.0467 0.8864 115.129 < 2e-16 ***
pc1 -111.8205 28.0295 -3.989 7.11e-05 ***
pc2 -21.6604 28.0295 -0.773 0.44
pc3 35.1350 28.0295 1.254 0.21
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 28.03 on 996 degrees of freedom
Multiple R-squared: 0.01783, Adjusted R-squared: 0.01487
F-statistic: 6.028 on 3 and 996 DF, p-value: 0.0004546
这里的R方为:0.01783
也可以这样提取:
r$> x2 = summary(mod_2)$r.squared
summary(mod_2)$r.squared
[1] 0.01783253
那么SNP:M57157的PVE为:3.1%
r$> x1 - x2
[1] 0.03100801
对比我们GAPIT的PVE结果:
结果完全一致。
这里,一般线性模型中,可以针对显著性的SNP,进行单位点回归分析,计算PVE。对于混合线性模型,也可以将显著性位点提取,进行R语言的手动计算,这个也是PVE计算的一种方法。
混合线性模型中,还有其它的计算方法,我们后面进行介绍,欢迎继续关注我。