前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >GWAS分析中SNP解释百分比PVE | 第二篇,GLM模型中如何计算PVE?

GWAS分析中SNP解释百分比PVE | 第二篇,GLM模型中如何计算PVE?

作者头像
邓飞
发布2021-12-27 17:00:02
1.4K0
发布2021-12-27 17:00:02
举报
文章被收录于专栏:育种数据分析之放飞自我

GWAS分析中SNP解释百分比PVE | 第二篇,GLM模型中如何计算PVE? #2021.12.22

1. 数据描述

协变量是PCA的前三个,数据具体如下:

「表型数据:」

代码语言:javascript
复制
1641 1641 153.973423
1642 1642 113.119301
1643 1643 77.094801
1644 1644 89.293866
1645 1645 94.511433
1646 1646 134.410909
1647 1647 121.246759
1648 1648 45.699443
1649 1649 67.786229
1650 1650 102.225648

「协变量数据」

代码语言:javascript
复制
1641 1641 0.0633225 0.0285328 -0.00734127
1642 1642 0.0684039 0.0212648 -0.000664363
1643 1643 0.0609595 0.0242615 -0.00206211
1644 1644 0.0636631 0.0241012 -0.00275062
1645 1645 0.0741456 0.0293644 0.00068775
1646 1646 0.114417 0.0443451 -0.0408687
1647 1647 0.111599 0.0400401 -0.0320522
1648 1648 0.100862 0.0344925 -0.0386237
1649 1649 0.107986 0.028411 -0.0312307
1650 1650 0.108836 0.0377951 -0.0314308

「基因型数据:」

代码语言:javascript
复制

FID IID PAT MAT SEX PHENOTYPE M4_A M6_T M9_T M11_A M17_A M19_A M22_A M24_A M27_A M31_
1641 1641 0 0 0 -9 0 1 2 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1
1642 1642 0 0 0 -9 1 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
1643 1643 0 0 0 -9 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1
1644 1644 0 0 0 -9 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1
1645 1645 0 0 0 -9 0 1 2 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1
1646 1646 0 0 0 -9 1 0 0 0 0 0 0 2 2 2 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
1647 1647 0 0 0 -9 0 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
1648 1648 0 0 0 -9 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1649 1649 0 0 0 -9 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
1650 1650 0 0 0 -9 1 0 0 0 0 0 0 1 1 2 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1651 1651 0 0 0 -9 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1652 1652 0 0 0 -9 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
1653 1653 0 0 0 -9 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
1654 1654 0 0 0 -9 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
1655 1655 0 0 0 -9 2 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
1656 1656 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1657 1657 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1658 1658 0 0 0 -9 2 0 0 0 1 1 2 2 2 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2. GAPIT软件分析GLM模型

GAPIT软件安装,见:如何安装GAPIT软件:https://zhuanlan.zhihu.com/p/268327005

视频版:http://mpvideo.qpic.cn/0b786uaaaaaau4ak6rfcqrpvb5odad2qaaaa.f10002.mp4?dis_k=9ee6c61bca58d4e0b4967a7189633630&dis_t=1640595519&vid=wxv_1592241064646672385&format_id=10002&support_redirect=0&mmversion=false

「分析代码如下:」

代码语言:javascript
复制

library(data.table)
raw = fread("plink.raw",header=T)
raw[1:10,1:10]

map = fread("recode.map",header=F)
head(map)

rr = raw[,!c(1,3:6)]
names(rr) = c("taxa",map$V2)

mm = map[,c(2,1,4)]
names(mm) = c("Name","Chromosome","Position")
head(mm)
rr[1:10,1:10]

fwrite(rr,"mdp_numeric.txt",sep="\t")
fwrite(mm,"mdp_SNP_information.txt",sep="\t")

#

# GWAS 分析
library(data.table)
source("http://zzlab.net/GAPIT/GAPIT.library.R")
source("http://zzlab.net/GAPIT/gapit_functions.txt")

myGd = fread("mdp_numeric.txt",header=T,data.table = F)
myGM = fread("mdp_SNP_information.txt",header = T,data.table=F)

myY = fread("dat_plink.txt",data.table = F)
head(myY)

covar = fread("cov_plink.txt",data.table = F)[,-1]
names(covar)[1] = "Taxa"
head(covar)

myGAPIT = GAPIT(Y = myY[,c(1,3)],GD = myGd, GM = myGM,
                # PCA.total=3,
                CV = covar,
                model="GLM")

「注意:」

  • 这里使用的是plink格式的表型和PCA结果,使用的是plink.raw文件为基因型数据
  • 将其转化为gapit软件需要的格式
  • 定义基因型和位置信息,定义表型,定义协变量,定义模型为GLM
  • 结果文件为:GAPIT.GLM.V3.GWAS.Results.csv

3. GLM模型结果文件

结果文件如下:包括:

  • SNP名称
  • 染色体
  • 物理位置
  • P值 # 重点关注
  • maf
  • Rsquare.of.Model.without.SNP # 重点关注
  • Rsquare.of.Model.with.SNP # 重点关注
代码语言:javascript
复制
SNP,Chromosome,Position ,P.value,maf,nobs,Rsquare.of.Model.without.SNP,Rsquare.of.Model.with.SNP,FDR_Adjusted_P-values,effect
M57157,12,44,0.0000000265778189052299,0.375,1000,0.0178325320026741,0.0488405467802565,0.000799726570858368,7.5184135943861
M57155,12,44,0.000000149990719134015,0.3275,1000,0.0178325320026741,0.0454335956044116,0.00115674059672426,7.24232622801506
M57156,12,44,0.000000149990719134015,0.3275,1000,0.0178325320026741,0.0454335956044116,0.00115674059672426,7.24232622801506
M98663,20,73,0.00000015377076726145,0.222,1000,0.0178325320026741,0.0453847644903734,0.00115674059672426,-7.81213756957444
M73233,15,64,0.000000512284630336812,0.2255,1000,0.0178325320026741,0.0430299031398552,0.00290006393781271,-7.52596405927775
M8452,2,69,0.000000578277953701437,0.435,1000,0.0178325320026741,0.0427934752525868,0.00290006393781271,-6.51962588787404
M45998,10,20,0.000000768626039613293,0.411,1000,0.0178325320026741,0.0422387923854602,0.00307979999092427,-6.69120649416338
M82957,17,57,0.00000112940850557263,0.1685,1000,0.0178325320026741,0.0414897712489706,0.00307979999092427,8.48568078650835
M82958,17,57,0.00000112940850557263,0.1685,1000,0.0178325320026741,0.0414897712489706,0.00307979999092427,8.48568078650835

这里,

  • P值 为SNP的P值,我们根据它判断显著性,并根据它进行QQ图和曼哈顿图的绘制
  • Rsquare.of.Model.without.SNP # 这个是单位点不包括次SNP的解释百分比(R方)
  • Rsquare.of.Model.with.SNP # 这个是单位点包括此SNP的解释百分比(R方)

「上面两者之差,即为该SNP的解释百分比(PVE)」

代码语言:javascript
复制
$$SNP的PVE = Rsquare.of.Model.with.SNP - Rsquare.of.Model.without.SNP$$

4. 我们将结果读入R语言计算

首先读入R语言:

代码语言:javascript
复制
r$> re = fread("GAPIT.GLM.V3.GWAS.Results.csv")

r$> head(re)
      SNP Chromosome Position      P.value    maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values    effect
1: M57157         12       44 2.657782e-08 0.3750 1000                   0.01783253                0.04884055          0.0007997266  7.518414
2: M57155         12       44 1.499907e-07 0.3275 1000                   0.01783253                0.04543360          0.0011567406  7.242326
3: M57156         12       44 1.499907e-07 0.3275 1000                   0.01783253                0.04543360          0.0011567406  7.242326
4: M98663         20       73 1.537708e-07 0.2220 1000                   0.01783253                0.04538476          0.0011567406 -7.812138
5: M73233         15       64 5.122846e-07 0.2255 1000                   0.01783253                0.04302990          0.0029000639 -7.525964
6:  M8452          2       69 5.782780e-07 0.4350 1000                   0.01783253                0.04279348          0.0029000639 -6.519626

然后我们根据上面的公式,手动计算PVE值:

代码语言:javascript
复制
r$> re$PVE = re$Rsquare.of.Model.with.SNP - re$Rsquare.of.Model.without.SNP

r$> head(re)
      SNP Chromosome Position      P.value    maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values    effect        PVE
1: M57157         12       44 2.657782e-08 0.3750 1000                   0.01783253                0.04884055          0.0007997266  7.518414 0.03100801
2: M57155         12       44 1.499907e-07 0.3275 1000                   0.01783253                0.04543360          0.0011567406  7.242326 0.02760106
3: M57156         12       44 1.499907e-07 0.3275 1000                   0.01783253                0.04543360          0.0011567406  7.242326 0.02760106
4: M98663         20       73 1.537708e-07 0.2220 1000                   0.01783253                0.04538476          0.0011567406 -7.812138 0.02755223
5: M73233         15       64 5.122846e-07 0.2255 1000                   0.01783253                0.04302990          0.0029000639 -7.525964 0.02519737
6:  M8452          2       69 5.782780e-07 0.4350 1000                   0.01783253                0.04279348          0.0029000639 -6.519626 0.02496094

根据PVE的大小,进行排序,降序:

代码语言:javascript
复制
r$> re %>% arrange(-PVE)
          SNP Chromosome Position      P.value    maf nobs Rsquare.of.Model.without.SNP Rsquare.of.Model.with.SNP FDR_Adjusted_P-values        effect          PVE
    1: M57157         12       44 2.657782e-08 0.3750 1000                   0.01783253                0.04884055          0.0007997266  7.5184135944 3.100801e-02
    2: M57155         12       44 1.499907e-07 0.3275 1000                   0.01783253                0.04543360          0.0011567406  7.2423262280 2.760106e-02
    3: M57156         12       44 1.499907e-07 0.3275 1000                   0.01783253                0.04543360          0.0011567406  7.2423262280 2.760106e-02
    4: M98663         20       73 1.537708e-07 0.2220 1000                   0.01783253                0.04538476          0.0011567406 -7.8121375696 2.755223e-02
    5: M73233         15       64 5.122846e-07 0.2255 1000                   0.01783253                0.04302990          0.0029000639 -7.5259640593 2.519737e-02
   ---
30086: M58296         12       66 9.996606e-01 0.2700 1000                   0.01783253                0.01783253          0.9997934933  0.0006040638 1.785355e-10
30087:   M751          1       15 9.998043e-01 0.2435 1000                   0.01783253                0.01783253          0.9998960441  0.0003736382 5.935820e-11
30088: M21736          5       35 9.998296e-01 0.3020 1000                   0.01783253                0.01783253          0.9998960441  0.0002955665 4.500680e-11
30089: M80677         17       12 9.999131e-01 0.3830 1000                   0.01783253                0.01783253          0.9999300138 -0.0001448323 1.170910e-11
30090: M57593         12       52 9.999300e-01 0.2735 1000                   0.01783253                0.01783253          0.9999300138 -0.0001300807 7.589301e-12

可以看到,结果就给出了PVE从大到小的排序结果。

这里,我们注意到,前三个SNP的PVE分别是:

  • 3.1%
  • 2.76%
  • 2.76%
  • 而且,这三个SNP都在12号染色体的44个位置,因为这三个位点离得很近,都达到显著水平,PVE也很大,那我们可以认为附件有一个基因,显著影响表型。

这里,我们对PVE进行求和:

代码语言:javascript
复制
r$> sum(re$PVE)
[1] 57.6114

可以看到,总的PVE之和为57.611,远远大于1,是因为有些位点高度LD,所以PVE之和就很大。相关问题在 GWAS分析中SNP解释百分比PVE | 第一篇,SNP解释百分比之和为何大于1?中有过介绍。

5. 用R语言如何计算?

简单来说,就是单位点的回归分析,计算R方。

这里,我们用同样的数据,在R中进行GLM的GWAS分析。

代码如下:

代码语言:javascript
复制

library(data.table)
geno = fread("plink.raw")[,!c(1,3:6)]
map = fread("recode.map")
m012 = geno
names(m012) = c("IID",map$V2)
phe = fread("phe.txt")
cov1= fread("tcov.txt")

dat = left_join(phe,cov1,by = c("V1"="V1")) %>% dplyr::select(IID =1, y =2, pc1=3,pc2=4,pc3=5)
dat1 = left_join(dat,m012,by="IID")
# dat1$M57157 = as.factor(dat1$M57157)
mod_1 = lm(y ~  1+pc1 + pc2 + pc3 + M57157,data=dat1);summary(mod_1)
summary(mod_1)$r.squared

mod_2 = lm(y ~  1+pc1 + pc2 + pc3 ,data=dat1);summary(mod_2)
summary(mod_2)$r.squared

这里,我们将PCA的前三个作为协变量加到回归分析汇总。

包含SNP的回归分析:

代码语言:javascript
复制
r$> mod_1 = lm(y ~  1+pc1 + pc2 + pc3 + M57157,data=dat1);summary(mod_1)

Call:
lm(formula = y ~ 1 + pc1 + pc2 + pc3 + M57157, data = dat1)

Residuals:
    Min      1Q  Median      3Q     Max
-85.094 -19.174  -0.443  18.218  79.181

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   96.408      1.320  73.048  < 2e-16 ***
pc1         -108.543     27.603  -3.932 9.00e-05 ***
pc2           28.740     28.981   0.992   0.3216
pc3           58.098     27.890   2.083   0.0375 *
M57157         7.518      1.320   5.695 1.62e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.6 on 995 degrees of freedom
Multiple R-squared:  0.04884,	Adjusted R-squared:  0.04502
F-statistic: 12.77 on 4 and 995 DF,  p-value: 3.838e-10

可以看到,R方为0.04884,也可以这样提取:

代码语言:javascript
复制
r$> summary(mod_1)$r.squared
[1] 0.04884055

不包含SNP的回归分析:

代码语言:javascript
复制
r$> mod_2 = lm(y ~  1+pc1 + pc2 + pc3 ,data=dat1);summary(mod_2)

Call:
lm(formula = y ~ 1 + pc1 + pc2 + pc3, data = dat1)

Residuals:
    Min      1Q  Median      3Q     Max
-89.495 -19.318  -0.099  18.649  85.448

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  102.0467     0.8864 115.129  < 2e-16 ***
pc1         -111.8205    28.0295  -3.989 7.11e-05 ***
pc2          -21.6604    28.0295  -0.773     0.44
pc3           35.1350    28.0295   1.254     0.21
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 28.03 on 996 degrees of freedom
Multiple R-squared:  0.01783,	Adjusted R-squared:  0.01487
F-statistic: 6.028 on 3 and 996 DF,  p-value: 0.0004546

这里的R方为:0.01783

也可以这样提取:

代码语言:javascript
复制
r$> x2 = summary(mod_2)$r.squared
    summary(mod_2)$r.squared
[1] 0.01783253

那么SNP:M57157的PVE为:3.1%

代码语言:javascript
复制
r$> x1 - x2
[1] 0.03100801

对比我们GAPIT的PVE结果:

结果完全一致。

这里,一般线性模型中,可以针对显著性的SNP,进行单位点回归分析,计算PVE。对于混合线性模型,也可以将显著性位点提取,进行R语言的手动计算,这个也是PVE计算的一种方法。

混合线性模型中,还有其它的计算方法,我们后面进行介绍,欢迎继续关注我。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-12-22,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 育种数据分析之放飞自我 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • GWAS分析中SNP解释百分比PVE | 第二篇,GLM模型中如何计算PVE? #2021.12.22
    • 1. 数据描述
      • 2. GAPIT软件分析GLM模型
        • 3. GLM模型结果文件
          • 4. 我们将结果读入R语言计算
            • 5. 用R语言如何计算?
            领券
            问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档