首页
学习
活动
专区
圈层
工具
发布
社区首页 >专栏 >比较微生物组中的差异分析方法

比较微生物组中的差异分析方法

作者头像
生信菜鸟团
发布2021-12-13 17:32:50
发布2021-12-13 17:32:50
8K0
举报
文章被收录于专栏:生信菜鸟团生信菜鸟团

在微生物组研究中我们常常需要根据某些感兴趣的表型来找到与其相关的特征(比如菌群、OTU、基因家族等等)。但微生物组学的数据结构导致了这必然是一项相当艰巨的任务,因为他们:

•高维特征集(通常超过 100 到 10,000 个特征);•高度稀疏(许多特征仅在少数样本中被发现);•特征间复杂的相关性结构;•计数的组成性(即,观察到的计数受文库大小的限制);•不同的文库大小;•过度离散的计数值,等等。

那么应该如何选择不同的差异分析方法呢?其实这个问题并没有答案,(如果有时间的话)我一般都是尝试一些对手头数据来说看似合理的模型,然后优先考虑 overlap 的差异特征集。虽然这并不完美,但至少会证明一些结果的鲁棒性,增加我们对结果的信心。

下面我将基于一个用 MetaPhlAn2 注释的公共宏基因组数据,使用五种不同算法进行差异分析。这些方法也可以应用于(也许更适用于)扩增子测序得到的 ASV 或 OTU。选择这些方法的标准如下

•在一项或多项模拟研究中表现较好;•可以校正协变量,和多重假设检验;•包含多种标准化和建模方法;•应用相对广泛;•封装成 R 包。

符合以上条件的有很多算法,但这里我挑选了以下五个模型:

•Limma-Voom[1]•DESeq2[2]•corncob[3]•MaAsLin 2[4]•ANCOM-BC[5]

我们将使用由 curatedMetagenomicData[6] 包(关于这个包的教程可以参见我之前的笔记)提供的公共数据[7] 来识别从印度南部与印度中北部人群收集的粪便样本中的差异菌群。

相关文章:D B Dhakan, A Maji, A K Sharma, R Saxena, J Pulikkan, T Grace, A Gomez, J Scaria, K R Amato, V K Sharma, The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches, GigaScience, Volume 8, Issue 3, March 2019, giz004, https://doi.org/10.1093/gigascience/giz004

安装并加载所需的 R 包

代码语言:javascript
复制
# if (!requireNamespace("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# 
# BiocManager::install("curatedMetagenomicData")
# BiocManager::install("phyloseq")
# BiocManager::install("edgeR")
# BiocManager::install("limma")
# BiocManager::install("DEFormats")
# BiocManager::install("DESeq2")
# BiocManager::install("apeglm")
# BiocManager::install("ANCOMBC")
# BiocManager::install("Maaslin2")
# 
# install.packages("corncob")
# install.packages("tidyverse")
# install.packages("ggVennDiagram")

library(curatedMetagenomicData)
library(tidyverse)
library(phyloseq)
library(edgeR)
library(limma)
library(DEFormats)
library(DESeq2)
library(apeglm)
library(corncob)
library(ANCOMBC)
library(Maaslin2)
library(ggVennDiagram)

下载公共宏基因组数据

代码语言:javascript
复制
dhakan_eset <- DhakanDB_2019.metaphlan_bugs_list.stool()
代码语言:javascript
复制
## snapshotDate(): 2020-04-27
代码语言:javascript
复制
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
代码语言:javascript
复制
## loading from cache
代码语言:javascript
复制
(ps <- ExpressionSet2phyloseq(dhakan_eset,
                              simplify = TRUE,
                              relab = FALSE,
                              phylogenetictree = FALSE))
代码语言:javascript
复制
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 768 taxa and 110 samples ]
## sample_data() Sample Data:       [ 110 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 768 taxa by 8 taxonomic ranks ]
代码语言:javascript
复制
(ps <- subset_taxa(ps, !is.na(Species) & is.na(Strain)))
代码语言:javascript
复制
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 286 taxa and 110 samples ]
## sample_data() Sample Data:       [ 110 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 286 taxa by 8 taxonomic ranks ]

删除物种注释中的 “s__” 占位符:

代码语言:javascript
复制
taxa_names(ps) <- gsub("s__", "", taxa_names(ps))                        
head(taxa_names(ps))
代码语言:javascript
复制
## [1] "Prevotella_copri"             "Eubacterium_rectale"         
## [3] "Butyrivibrio_crossotus"       "Prevotella_stercorea"        
## [5] "Faecalibacterium_prausnitzii" "Subdoligranulum_unclassified"

剔除仅在 <10% 样本中出现的菌种:

代码语言:javascript
复制
(ps <- filter_taxa(ps, function(x) sum(x > 0) > (0.1*length(x)), TRUE))   
代码语言:javascript
复制
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 109 taxa and 110 samples ]
## sample_data() Sample Data:       [ 110 samples by 20 sample variables ]
## tax_table()   Taxonomy Table:    [ 109 taxa by 8 taxonomic ranks ]

查看数据中包括的 metadata 信息:

代码语言:javascript
复制
sample_variables(ps)
代码语言:javascript
复制
##  [1] "subjectID"               "body_site"              
##  [3] "antibiotics_current_use" "study_condition"        
##  [5] "disease"                 "age"                    
##  [7] "infant_age"              "age_category"           
##  [9] "gender"                  "BMI"                    
## [11] "country"                 "location"               
## [13] "non_westernized"         "DNA_extraction_kit"     
## [15] "number_reads"            "number_bases"           
## [17] "minimum_read_length"     "median_read_length"     
## [19] "curator"                 "NCBI_accession"
代码语言:javascript
复制
sample_data(ps)$location <- factor(sample_data(ps)$location, levels = c("Bhopal", "Kerala"))
table(sample_data(ps)$location) 
代码语言:javascript
复制
## 
## Bhopal Kerala 
##     53     57

在过滤了低流行率的分类群后,我们现在得到了一个包含 109 个菌种的 phyloseq 对象。我一般倾向于根据总数和流行率过滤掉仅在 10% 到 50% 的样本中观察到的特征,以更好地满足模型假设,同时限制计算 power 时所付出的 FDR 惩罚。(这里总共 109 个菌种肯定是偏低的,但本文仅作示例)

Limma-Voom

常用于基因表达矩阵分析的 Limma 包也可用于菌群矩阵的差异分析。

代码语言:javascript
复制
dds <- phyloseq_to_deseq2(ps, ~ location)      #convert to DESeq2 and DGEList objects
代码语言:javascript
复制
## converting counts to integer mode
代码语言:javascript
复制
dge <- as.DGEList(dds)
# 计算 TMM 归一化因子
dge <- calcNormFactors(dge, method = "TMM")
head(dge$samples$norm.factors)
代码语言:javascript
复制
## [1] 0.4529756 1.6721557 0.1509053 0.4569702 0.4646246 0.3362776
代码语言:javascript
复制
# 建立模型矩阵
mm <- model.matrix(~ group, dge$samples)
head(mm)
代码语言:javascript
复制
##          (Intercept) groupKerala
## DHAK_HAK           1           0
## DHAK_HAJ           1           0
## DHAK_HAI           1           0
## DHAK_HAH           1           0
## DHAK_HAG           1           0
## DHAK_HAF           1           0
代码语言:javascript
复制
table(mm[, 2])
代码语言:javascript
复制
## 
##  0  1 
## 53 57
代码语言:javascript
复制
# 得到 voom 权重
y <- voom(dge, mm, plot = T)

voom 的过程: 1.将计数转换为 log2 CPM(counts per million reads),其中 “per million reads” 是根据我们之前计算的归一化因子定义的;2.线性模型拟合每个特征的 log2 CPM,并计算残差;3.基于平均表达量拟合平滑曲线(见上图中的红线);4.获得每个特征和样本的权重。

使用 lmFit 拟合线性模型:

代码语言:javascript
复制
fit <- lmFit(y, mm)                                   #fit lm with limma
# 经验贝叶斯统计量计算(参见 https://www.degruyter.com/doi/10.2202/1544-6115.1027)
fit <- eBayes(fit)
head(coef(fit))
代码语言:javascript
复制
##                              (Intercept) groupKerala
## Prevotella_copri              16.1383696 -4.99436530
## Eubacterium_rectale           11.9161328 -0.08294369
## Butyrivibrio_crossotus        -0.2110634 -0.08650380
## Prevotella_stercorea           6.8200020 -6.22106379
## Faecalibacterium_prausnitzii  13.6093282  1.00607350
## Subdoligranulum_unclassified  10.9786786  1.16771790

提取计算结果:

代码语言:javascript
复制
limma_res_df <- data.frame(topTable(fit, coef = "groupKerala", number = Inf))    #extract results

fdr_limma <- limma_res_df %>%
    dplyr::filter(adj.P.Val < 0.05) %>%
    rownames_to_column(var = "Species")

dim(fdr_limma)
代码语言:javascript
复制
## [1] 15  7
代码语言:javascript
复制
head(fdr_limma)
代码语言:javascript
复制
##                    Species     logFC    AveExpr         t      P.Value
## 1 Megasphaera_unclassified -8.974360  6.4247739 -5.575994 1.059729e-07
## 2    Bacteroides_coprocola -5.124762 -0.8291721 -4.484538 1.409614e-05
## 3     Prevotella_stercorea -6.221064  3.2149583 -3.867134 1.613340e-04
## 4    Lactobacillus_ruminis -5.711738  3.7967638 -3.833832 1.826516e-04
## 5       Ruminococcus_obeum  4.659230  2.7788048  3.737572 2.603217e-04
## 6 Mitsuokella_unclassified -5.237836  0.7101123 -3.732304 2.653693e-04
##      adj.P.Val         B
## 1 1.155105e-05 6.9755057
## 2 7.682395e-04 2.8793949
## 3 4.820876e-03 0.6994391
## 4 4.820876e-03 0.5749799
## 5 4.820876e-03 0.2933741
## 6 4.820876e-03 0.2931618
代码语言:javascript
复制
ggplot(fdr_limma, aes(x = Species, y = logFC, color = Species)) +
  geom_point(size = 4) +
  labs(y = "\nLog2 Fold-Change for ACVD vs. Controls", x = "") +
  theme(axis.text.x = element_text(color = "black", size = 12),
        axis.text.y = element_text(color = "black", size = 12),
        axis.title.y = element_text(size = 14),
        axis.title.x = element_text(size = 14),
        legend.text = element_text(size = 12),
        legend.title = element_text(size = 12),
        legend.position = "none") +
  coord_flip() +
  geom_hline(yintercept = 0, linetype="dotted")

可以看到根据校正后的 P 值 < 0.05,limma-voom 找到了 15 个差异菌。

DESeq2

DESeq2 将对原始计数进行建模,使用标准化因子(scale factor)来解释库深度的差异。然后估计每条 OTU 的离散度,并缩小这些估计值以生成更准确的离散度估计。最后,DESeq2 拟合负二项分布的模型,并使用 Wald 检验或似然比检验进行假设检验。

代码语言:javascript
复制
dds <- DESeq(dds, test = "Wald", fitType = "local", sfType = "poscounts")
代码语言:javascript
复制
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 65 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
代码语言:javascript
复制
plotDispEsts(dds)

使用 apeglm 对计数值的离散度进行校正:

代码语言:javascript
复制
res <- lfcShrink(dds, coef=2, type="apeglm") 
代码语言:javascript
复制
## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
##     Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
##     sequence count data: removing the noise and preserving large differences.
##     Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
代码语言:javascript
复制
plotMA(dds)
代码语言:javascript
复制
deseq_res_df <- data.frame(res) %>%
  rownames_to_column(var = "Species") %>%
  dplyr::arrange(padj)                                 

fdr_deseq <- deseq_res_df %>%
    dplyr::filter(padj < 0.05)

dim(fdr_deseq)
代码语言:javascript
复制
## [1] 5 6
代码语言:javascript
复制
head(fdr_deseq)
代码语言:javascript
复制
##                       Species   baseMean log2FoldChange     lfcSE       pvalue
## 1    Bacteroides_massiliensis  2326.9291      0.1667126 0.9692252 9.311151e-32
## 2       Clostridium_symbiosum   346.6204      0.5656621 1.2730261 4.669930e-32
## 3       Clostridium_hathewayi   314.3198      0.4874255 1.2012990 1.764584e-28
## 4 Parabacteroides_goldsteinii   256.9873      0.2564174 1.0225083 5.847951e-25
## 5       Bacteroides_coprocola 19141.8394     -0.7224703 1.3784468 2.200809e-03
##           padj
## 1 4.934910e-30
## 2 4.934910e-30
## 3 6.234864e-27
## 4 1.549707e-23
## 5 4.665715e-02
代码语言:javascript
复制
ggplot(fdr_deseq, aes(x = Species, y = log2FoldChange, color = Species)) +
    geom_point(size = 4) +
    labs(y = "\nLog2 Fold-Change for ACVD vs. Controls", x = "") +
    theme(axis.text.x = element_text(color = "black", size = 12),
          axis.text.y = element_text(color = "black", size = 12),
          axis.title.y = element_text(size = 14),
          axis.title.x = element_text(size = 14),
          legend.text = element_text(size = 12),
          legend.title = element_text(size = 12),
          legend.position = "none") +
    coord_flip() +
    geom_hline(yintercept = 0, linetype="dotted")

根据校正后的 P 值 < 0.05,DESeq2 找到了 5 个差异菌。

Corncob

Corncob 则是基于相对丰度进行建模并检验协变量对相对丰度的影响。

•GitHub 地址:https://github.com/bryandmartin/corncob/

代码语言:javascript
复制
corn_da <- differentialTest(formula = ~ location,
                            phi.formula = ~ 1,
                            formula_null = ~ 1,
                            phi.formula_null = ~ 1,
                            data = ps,
                            test = "Wald", boot = FALSE,
                            fdr_cutoff = 0.05)

fdr_corncob <- corn_da$significant_taxa
dim(data.frame(fdr_corncob))
代码语言:javascript
复制
## [1] 28  1
代码语言:javascript
复制
head(sort(corn_da$p_fdr))                    
代码语言:javascript
复制
##     Megasphaera_unclassified           Ruminococcus_obeum 
##                 1.576463e-06                 1.041461e-04 
##         Prevotella_stercorea Bacteroides_thetaiotaomicron 
##                 3.697782e-04                 3.965007e-04 
##        Bacteroides_uniformis        Lactobacillus_ruminis 
##                 4.262191e-04                 5.608007e-04

Corncob 找到了 28 个差异特征。

MaAsLin 2

MaAsLin2 是 MaAsLin(Microbiome Multivariable Association with Linear Models) 的升级版,主要基于线性模型进行多元关联分析,分析表型和微生物特征之间的相关性。

•首页:https://huttenhower.sph.harvard.edu/maaslin/•GitHub 地址:https://github.com/biobakery/Maaslin2

代码语言:javascript
复制
mas_1 <- Maaslin2(
  input_data = data.frame(otu_table(ps)),
  input_metadata = data.frame(sample_data(ps)),
  output = "./Maaslin2_default_output",
  min_abundance = 0.0,
  min_prevalence = 0.0,
  normalization = "TSS",
  transform = "LOG",
  analysis_method = "LM",
  max_significance = 0.05,
  fixed_effects = "location",
  correction = "BH",
  standardize = FALSE,
  cores = 1)
代码语言:javascript
复制
mas_res_df <- mas_1$results

fdr_mas <- mas_res_df %>%
    dplyr::filter(qval < 0.05)

dim(fdr_mas)
代码语言:javascript
复制
## [1] 27 10
代码语言:javascript
复制
head(fdr_mas)
代码语言:javascript
复制
##                                       feature metadata  value       coef
## locationKerala32     Megasphaera_unclassified location Kerala -0.7713398
## locationKerala29           Ruminococcus_obeum location Kerala  0.6891505
## locationKerala14          Ruminococcus_bromii location Kerala  0.8985006
## locationKerala28 Bacteroides_thetaiotaomicron location Kerala  0.8813854
## locationKerala70     Streptococcus_salivarius location Kerala  0.5409692
## locationKerala3          Prevotella_stercorea location Kerala -1.0280049
##                     stderr         pval           name        qval   N
## locationKerala32 0.1583627 3.828477e-06 locationKerala 0.000417304 110
## locationKerala29 0.1577677 2.887477e-05 locationKerala 0.001573675 110
## locationKerala14 0.2122802 4.865394e-05 locationKerala 0.001767760 110
## locationKerala28 0.2271755 1.801616e-04 locationKerala 0.004347755 110
## locationKerala70 0.1404570 1.994383e-04 locationKerala 0.004347755 110
## locationKerala3  0.2774162 3.343767e-04 locationKerala 0.005206723 110
##                  N.not.zero
## locationKerala32          0
## locationKerala29          0
## locationKerala14          0
## locationKerala28          0
## locationKerala70          0
## locationKerala3           0

MaAsLin 2 找到了 27 个差异菌种。MaAsLin 2[8] 支持多种归一化方法和模型,我们可以用它来快速拟合不同的模型,看看这些模型对结果的影响。

ANCOM-BC

ANCOM-BC 引入了一种包含偏差校正的微生物组组成分析方法,该方法可以估计未知的抽样比例,并校正由样品之间的差异引起的偏差,绝对丰度数据使用线性回归框架建模。

•GitHub 地址:https://github.com/FrederickHuangLin/ANCOM-BC-Code-Archive

代码语言:javascript
复制
ancom_da <- ancombc(phyloseq = ps, formula = "location", 
              p_adj_method = "fdr", zero_cut = 0.90, lib_cut = 1000, 
              group = "location", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5, 
              max_iter = 100, conserve = TRUE, alpha = 0.05, global = FALSE)

ancom_res_df <- data.frame(
  Species = row.names(ancom_da$res$beta),
  beta = unlist(ancom_da$res$beta),
  se = unlist(ancom_da$res$se),
  W = unlist(ancom_da$res$W),
  p_val = unlist(ancom_da$res$p_val),
  q_val = unlist(ancom_da$res$q_val),
  diff_abn = unlist(ancom_da$res$diff_abn))

fdr_ancom <- ancom_res_df %>%
  dplyr::filter(q_val < 0.05)

dim(fdr_ancom)
代码语言:javascript
复制
## [1] 22  7
代码语言:javascript
复制
head(fdr_ancom)
代码语言:javascript
复制
##                                       Species      beta        se         W
## locationKerala4          Prevotella_stercorea -4.414224 1.1839905 -3.728259
## locationKerala9      Mitsuokella_unclassified -3.467118 1.0176606 -3.406950
## locationKerala15          Ruminococcus_bromii  3.682311 1.0497644  3.507750
## locationKerala18        Lactobacillus_ruminis -3.608518 1.0441049 -3.456087
## locationKerala22    Catenibacterium_mitsuokai -3.055131 0.9621105 -3.175447
## locationKerala29 Bacteroides_thetaiotaomicron  3.143050 0.8682526  3.619973
##                         p_val       q_val locationKerala
## locationKerala4  0.0001928069 0.006229387           TRUE
## locationKerala9  0.0006569326 0.007160566           TRUE
## locationKerala15 0.0004519128 0.007029239           TRUE
## locationKerala18 0.0005480782 0.007029239           TRUE
## locationKerala22 0.0014960575 0.013589189           TRUE
## locationKerala29 0.0002946343 0.006423028           TRUE

ANCOM-BC 找到了 22 个差异物种。

不同方法结果之间的 overlap

代码语言:javascript
复制
x <- list(limma = fdr_limma$Species,
          deseq2 = fdr_deseq$Species,
          corncob = fdr_corncob,
          maaslin2 = fdr_mas$feature,
          ancom = fdr_ancom$Species)
          
overlap_data <- process_region_data(Venn(x))
overlap_data[overlap_data$id == 12345,]$item
代码语言:javascript
复制
## [[1]]
## [1] "Bacteroides_coprocola"
代码语言:javascript
复制
ggVennDiagram(x,label_alpha=0) +
  scale_fill_gradient(low="gray100",high = "gray95",guide="none")

在这五种方法中,只有一种菌 Bacteroides coprocola 在所有结果中都显著, FDR p 值 < 0.05。除了考虑到丰度差异外,我们还可以进一步考虑效应的大小(即倍数变化或系数的大小),看看这些被多种方法同时证实的结果是否合理,同时可进一步尝试探究不同模型方法之间的结果差异是否有明确的原因(例如,数据是否过度稀疏等等)。比如,在图中我们可以看到有 11 个菌被除 DESeq2 外的其余四种方法证实,这些菌或许就是下一步需要探究的方向。

Reference

•https://www.nicholas-ollberding.com/post/identifying-differentially-abundant-features-in-microbiome-data/

引用链接

[1] Limma-Voom: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29 [2] DESeq2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8 [3] corncob: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-14/issue-1/Modeling-microbial-abundances-and-dysbiosis-with-beta-binomial-regression/10.1214/19-AOAS1283.short [4] MaAsLin 2: https://www.biorxiv.org/content/10.1101/2021.01.20.427420v1 [5] ANCOM-BC: https://www.nature.com/articles/s41467-020-17041-7 [6] curatedMetagenomicData: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html [7] 数据: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html [8] MaAsLin 2: https://github.com/biobakery/Maaslin2

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2021-12-12,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信菜鸟团 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 安装并加载所需的 R 包
  • 下载公共宏基因组数据
  • Limma-Voom
  • DESeq2
  • Corncob
  • MaAsLin 2
  • ANCOM-BC
  • 不同方法结果之间的 overlap
    • 引用链接
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档