在微生物组研究中我们常常需要根据某些感兴趣的表型来找到与其相关的特征(比如菌群、OTU、基因家族等等)。但微生物组学的数据结构导致了这必然是一项相当艰巨的任务,因为他们:
•高维特征集(通常超过 100 到 10,000 个特征);•高度稀疏(许多特征仅在少数样本中被发现);•特征间复杂的相关性结构;•计数的组成性(即,观察到的计数受文库大小的限制);•不同的文库大小;•过度离散的计数值,等等。
那么应该如何选择不同的差异分析方法呢?其实这个问题并没有答案,(如果有时间的话)我一般都是尝试一些对手头数据来说看似合理的模型,然后优先考虑 overlap 的差异特征集。虽然这并不完美,但至少会证明一些结果的鲁棒性,增加我们对结果的信心。
下面我将基于一个用 MetaPhlAn2 注释的公共宏基因组数据,使用五种不同算法进行差异分析。这些方法也可以应用于(也许更适用于)扩增子测序得到的 ASV 或 OTU。选择这些方法的标准如下:
•在一项或多项模拟研究中表现较好;•可以校正协变量,和多重假设检验;•包含多种标准化和建模方法;•应用相对广泛;•封装成 R 包。
符合以上条件的有很多算法,但这里我挑选了以下五个模型:
•Limma-Voom[1]•DESeq2[2]•corncob[3]•MaAsLin 2[4]•ANCOM-BC[5]
我们将使用由 curatedMetagenomicData[6] 包(关于这个包的教程可以参见我之前的笔记)提供的公共数据[7] 来识别从印度南部与印度中北部人群收集的粪便样本中的差异菌群。
相关文章:D B Dhakan, A Maji, A K Sharma, R Saxena, J Pulikkan, T Grace, A Gomez, J Scaria, K R Amato, V K Sharma, The unique composition of Indian gut microbiome, gene catalogue, and associated fecal metabolome deciphered using multi-omics approaches, GigaScience, Volume 8, Issue 3, March 2019, giz004, https://doi.org/10.1093/gigascience/giz004
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("curatedMetagenomicData")
# BiocManager::install("phyloseq")
# BiocManager::install("edgeR")
# BiocManager::install("limma")
# BiocManager::install("DEFormats")
# BiocManager::install("DESeq2")
# BiocManager::install("apeglm")
# BiocManager::install("ANCOMBC")
# BiocManager::install("Maaslin2")
#
# install.packages("corncob")
# install.packages("tidyverse")
# install.packages("ggVennDiagram")
library(curatedMetagenomicData)
library(tidyverse)
library(phyloseq)
library(edgeR)
library(limma)
library(DEFormats)
library(DESeq2)
library(apeglm)
library(corncob)
library(ANCOMBC)
library(Maaslin2)
library(ggVennDiagram)dhakan_eset <- DhakanDB_2019.metaphlan_bugs_list.stool()## snapshotDate(): 2020-04-27## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation## loading from cache(ps <- ExpressionSet2phyloseq(dhakan_eset,
simplify = TRUE,
relab = FALSE,
phylogenetictree = FALSE))## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 768 taxa and 110 samples ]
## sample_data() Sample Data: [ 110 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 768 taxa by 8 taxonomic ranks ](ps <- subset_taxa(ps, !is.na(Species) & is.na(Strain)))## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 286 taxa and 110 samples ]
## sample_data() Sample Data: [ 110 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 286 taxa by 8 taxonomic ranks ]删除物种注释中的 “s__” 占位符:
taxa_names(ps) <- gsub("s__", "", taxa_names(ps))
head(taxa_names(ps))## [1] "Prevotella_copri" "Eubacterium_rectale"
## [3] "Butyrivibrio_crossotus" "Prevotella_stercorea"
## [5] "Faecalibacterium_prausnitzii" "Subdoligranulum_unclassified"剔除仅在 <10% 样本中出现的菌种:
(ps <- filter_taxa(ps, function(x) sum(x > 0) > (0.1*length(x)), TRUE)) ## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 109 taxa and 110 samples ]
## sample_data() Sample Data: [ 110 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 109 taxa by 8 taxonomic ranks ]查看数据中包括的 metadata 信息:
sample_variables(ps)## [1] "subjectID" "body_site"
## [3] "antibiotics_current_use" "study_condition"
## [5] "disease" "age"
## [7] "infant_age" "age_category"
## [9] "gender" "BMI"
## [11] "country" "location"
## [13] "non_westernized" "DNA_extraction_kit"
## [15] "number_reads" "number_bases"
## [17] "minimum_read_length" "median_read_length"
## [19] "curator" "NCBI_accession"sample_data(ps)$location <- factor(sample_data(ps)$location, levels = c("Bhopal", "Kerala"))
table(sample_data(ps)$location) ##
## Bhopal Kerala
## 53 57在过滤了低流行率的分类群后,我们现在得到了一个包含 109 个菌种的 phyloseq 对象。我一般倾向于根据总数和流行率过滤掉仅在 10% 到 50% 的样本中观察到的特征,以更好地满足模型假设,同时限制计算 power 时所付出的 FDR 惩罚。(这里总共 109 个菌种肯定是偏低的,但本文仅作示例)
常用于基因表达矩阵分析的 Limma 包也可用于菌群矩阵的差异分析。
dds <- phyloseq_to_deseq2(ps, ~ location) #convert to DESeq2 and DGEList objects## converting counts to integer modedge <- as.DGEList(dds)
# 计算 TMM 归一化因子
dge <- calcNormFactors(dge, method = "TMM")
head(dge$samples$norm.factors)## [1] 0.4529756 1.6721557 0.1509053 0.4569702 0.4646246 0.3362776# 建立模型矩阵
mm <- model.matrix(~ group, dge$samples)
head(mm)## (Intercept) groupKerala
## DHAK_HAK 1 0
## DHAK_HAJ 1 0
## DHAK_HAI 1 0
## DHAK_HAH 1 0
## DHAK_HAG 1 0
## DHAK_HAF 1 0table(mm[, 2])##
## 0 1
## 53 57# 得到 voom 权重
y <- voom(dge, mm, plot = T)
voom 的过程: 1.将计数转换为 log2 CPM(counts per million reads),其中 “per million reads” 是根据我们之前计算的归一化因子定义的;2.线性模型拟合每个特征的 log2 CPM,并计算残差;3.基于平均表达量拟合平滑曲线(见上图中的红线);4.获得每个特征和样本的权重。
使用 lmFit 拟合线性模型:
fit <- lmFit(y, mm) #fit lm with limma
# 经验贝叶斯统计量计算(参见 https://www.degruyter.com/doi/10.2202/1544-6115.1027)
fit <- eBayes(fit)
head(coef(fit))## (Intercept) groupKerala
## Prevotella_copri 16.1383696 -4.99436530
## Eubacterium_rectale 11.9161328 -0.08294369
## Butyrivibrio_crossotus -0.2110634 -0.08650380
## Prevotella_stercorea 6.8200020 -6.22106379
## Faecalibacterium_prausnitzii 13.6093282 1.00607350
## Subdoligranulum_unclassified 10.9786786 1.16771790提取计算结果:
limma_res_df <- data.frame(topTable(fit, coef = "groupKerala", number = Inf)) #extract results
fdr_limma <- limma_res_df %>%
dplyr::filter(adj.P.Val < 0.05) %>%
rownames_to_column(var = "Species")
dim(fdr_limma)## [1] 15 7head(fdr_limma)## Species logFC AveExpr t P.Value
## 1 Megasphaera_unclassified -8.974360 6.4247739 -5.575994 1.059729e-07
## 2 Bacteroides_coprocola -5.124762 -0.8291721 -4.484538 1.409614e-05
## 3 Prevotella_stercorea -6.221064 3.2149583 -3.867134 1.613340e-04
## 4 Lactobacillus_ruminis -5.711738 3.7967638 -3.833832 1.826516e-04
## 5 Ruminococcus_obeum 4.659230 2.7788048 3.737572 2.603217e-04
## 6 Mitsuokella_unclassified -5.237836 0.7101123 -3.732304 2.653693e-04
## adj.P.Val B
## 1 1.155105e-05 6.9755057
## 2 7.682395e-04 2.8793949
## 3 4.820876e-03 0.6994391
## 4 4.820876e-03 0.5749799
## 5 4.820876e-03 0.2933741
## 6 4.820876e-03 0.2931618ggplot(fdr_limma, aes(x = Species, y = logFC, color = Species)) +
geom_point(size = 4) +
labs(y = "\nLog2 Fold-Change for ACVD vs. Controls", x = "") +
theme(axis.text.x = element_text(color = "black", size = 12),
axis.text.y = element_text(color = "black", size = 12),
axis.title.y = element_text(size = 14),
axis.title.x = element_text(size = 14),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = "none") +
coord_flip() +
geom_hline(yintercept = 0, linetype="dotted")
可以看到根据校正后的 P 值 < 0.05,limma-voom 找到了 15 个差异菌。
DESeq2 将对原始计数进行建模,使用标准化因子(scale factor)来解释库深度的差异。然后估计每条 OTU 的离散度,并缩小这些估计值以生成更准确的离散度估计。最后,DESeq2 拟合负二项分布的模型,并使用 Wald 检验或似然比检验进行假设检验。
dds <- DESeq(dds, test = "Wald", fitType = "local", sfType = "poscounts")estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 65 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testingplotDispEsts(dds)
使用 apeglm 对计数值的离散度进行校正:
res <- lfcShrink(dds, coef=2, type="apeglm") ## using 'apeglm' for LFC shrinkage. If used in published research, please cite:
## Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
## sequence count data: removing the noise and preserving large differences.
## Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895plotMA(dds)
deseq_res_df <- data.frame(res) %>%
rownames_to_column(var = "Species") %>%
dplyr::arrange(padj)
fdr_deseq <- deseq_res_df %>%
dplyr::filter(padj < 0.05)
dim(fdr_deseq)## [1] 5 6head(fdr_deseq)## Species baseMean log2FoldChange lfcSE pvalue
## 1 Bacteroides_massiliensis 2326.9291 0.1667126 0.9692252 9.311151e-32
## 2 Clostridium_symbiosum 346.6204 0.5656621 1.2730261 4.669930e-32
## 3 Clostridium_hathewayi 314.3198 0.4874255 1.2012990 1.764584e-28
## 4 Parabacteroides_goldsteinii 256.9873 0.2564174 1.0225083 5.847951e-25
## 5 Bacteroides_coprocola 19141.8394 -0.7224703 1.3784468 2.200809e-03
## padj
## 1 4.934910e-30
## 2 4.934910e-30
## 3 6.234864e-27
## 4 1.549707e-23
## 5 4.665715e-02ggplot(fdr_deseq, aes(x = Species, y = log2FoldChange, color = Species)) +
geom_point(size = 4) +
labs(y = "\nLog2 Fold-Change for ACVD vs. Controls", x = "") +
theme(axis.text.x = element_text(color = "black", size = 12),
axis.text.y = element_text(color = "black", size = 12),
axis.title.y = element_text(size = 14),
axis.title.x = element_text(size = 14),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.position = "none") +
coord_flip() +
geom_hline(yintercept = 0, linetype="dotted")
根据校正后的 P 值 < 0.05,DESeq2 找到了 5 个差异菌。
Corncob 则是基于相对丰度进行建模并检验协变量对相对丰度的影响。
•GitHub 地址:https://github.com/bryandmartin/corncob/
corn_da <- differentialTest(formula = ~ location,
phi.formula = ~ 1,
formula_null = ~ 1,
phi.formula_null = ~ 1,
data = ps,
test = "Wald", boot = FALSE,
fdr_cutoff = 0.05)
fdr_corncob <- corn_da$significant_taxa
dim(data.frame(fdr_corncob))## [1] 28 1head(sort(corn_da$p_fdr)) ## Megasphaera_unclassified Ruminococcus_obeum
## 1.576463e-06 1.041461e-04
## Prevotella_stercorea Bacteroides_thetaiotaomicron
## 3.697782e-04 3.965007e-04
## Bacteroides_uniformis Lactobacillus_ruminis
## 4.262191e-04 5.608007e-04Corncob 找到了 28 个差异特征。
MaAsLin2 是 MaAsLin(Microbiome Multivariable Association with Linear Models) 的升级版,主要基于线性模型进行多元关联分析,分析表型和微生物特征之间的相关性。
•首页:https://huttenhower.sph.harvard.edu/maaslin/•GitHub 地址:https://github.com/biobakery/Maaslin2
mas_1 <- Maaslin2(
input_data = data.frame(otu_table(ps)),
input_metadata = data.frame(sample_data(ps)),
output = "./Maaslin2_default_output",
min_abundance = 0.0,
min_prevalence = 0.0,
normalization = "TSS",
transform = "LOG",
analysis_method = "LM",
max_significance = 0.05,
fixed_effects = "location",
correction = "BH",
standardize = FALSE,
cores = 1)mas_res_df <- mas_1$results
fdr_mas <- mas_res_df %>%
dplyr::filter(qval < 0.05)
dim(fdr_mas)## [1] 27 10head(fdr_mas)## feature metadata value coef
## locationKerala32 Megasphaera_unclassified location Kerala -0.7713398
## locationKerala29 Ruminococcus_obeum location Kerala 0.6891505
## locationKerala14 Ruminococcus_bromii location Kerala 0.8985006
## locationKerala28 Bacteroides_thetaiotaomicron location Kerala 0.8813854
## locationKerala70 Streptococcus_salivarius location Kerala 0.5409692
## locationKerala3 Prevotella_stercorea location Kerala -1.0280049
## stderr pval name qval N
## locationKerala32 0.1583627 3.828477e-06 locationKerala 0.000417304 110
## locationKerala29 0.1577677 2.887477e-05 locationKerala 0.001573675 110
## locationKerala14 0.2122802 4.865394e-05 locationKerala 0.001767760 110
## locationKerala28 0.2271755 1.801616e-04 locationKerala 0.004347755 110
## locationKerala70 0.1404570 1.994383e-04 locationKerala 0.004347755 110
## locationKerala3 0.2774162 3.343767e-04 locationKerala 0.005206723 110
## N.not.zero
## locationKerala32 0
## locationKerala29 0
## locationKerala14 0
## locationKerala28 0
## locationKerala70 0
## locationKerala3 0MaAsLin 2 找到了 27 个差异菌种。MaAsLin 2[8] 支持多种归一化方法和模型,我们可以用它来快速拟合不同的模型,看看这些模型对结果的影响。
ANCOM-BC 引入了一种包含偏差校正的微生物组组成分析方法,该方法可以估计未知的抽样比例,并校正由样品之间的差异引起的偏差,绝对丰度数据使用线性回归框架建模。
•GitHub 地址:https://github.com/FrederickHuangLin/ANCOM-BC-Code-Archive
ancom_da <- ancombc(phyloseq = ps, formula = "location",
p_adj_method = "fdr", zero_cut = 0.90, lib_cut = 1000,
group = "location", struc_zero = TRUE, neg_lb = TRUE, tol = 1e-5,
max_iter = 100, conserve = TRUE, alpha = 0.05, global = FALSE)
ancom_res_df <- data.frame(
Species = row.names(ancom_da$res$beta),
beta = unlist(ancom_da$res$beta),
se = unlist(ancom_da$res$se),
W = unlist(ancom_da$res$W),
p_val = unlist(ancom_da$res$p_val),
q_val = unlist(ancom_da$res$q_val),
diff_abn = unlist(ancom_da$res$diff_abn))
fdr_ancom <- ancom_res_df %>%
dplyr::filter(q_val < 0.05)
dim(fdr_ancom)## [1] 22 7head(fdr_ancom)## Species beta se W
## locationKerala4 Prevotella_stercorea -4.414224 1.1839905 -3.728259
## locationKerala9 Mitsuokella_unclassified -3.467118 1.0176606 -3.406950
## locationKerala15 Ruminococcus_bromii 3.682311 1.0497644 3.507750
## locationKerala18 Lactobacillus_ruminis -3.608518 1.0441049 -3.456087
## locationKerala22 Catenibacterium_mitsuokai -3.055131 0.9621105 -3.175447
## locationKerala29 Bacteroides_thetaiotaomicron 3.143050 0.8682526 3.619973
## p_val q_val locationKerala
## locationKerala4 0.0001928069 0.006229387 TRUE
## locationKerala9 0.0006569326 0.007160566 TRUE
## locationKerala15 0.0004519128 0.007029239 TRUE
## locationKerala18 0.0005480782 0.007029239 TRUE
## locationKerala22 0.0014960575 0.013589189 TRUE
## locationKerala29 0.0002946343 0.006423028 TRUEANCOM-BC 找到了 22 个差异物种。
x <- list(limma = fdr_limma$Species,
deseq2 = fdr_deseq$Species,
corncob = fdr_corncob,
maaslin2 = fdr_mas$feature,
ancom = fdr_ancom$Species)
overlap_data <- process_region_data(Venn(x))
overlap_data[overlap_data$id == 12345,]$item## [[1]]
## [1] "Bacteroides_coprocola"ggVennDiagram(x,label_alpha=0) +
scale_fill_gradient(low="gray100",high = "gray95",guide="none")
在这五种方法中,只有一种菌 Bacteroides coprocola 在所有结果中都显著, FDR p 值 < 0.05。除了考虑到丰度差异外,我们还可以进一步考虑效应的大小(即倍数变化或系数的大小),看看这些被多种方法同时证实的结果是否合理,同时可进一步尝试探究不同模型方法之间的结果差异是否有明确的原因(例如,数据是否过度稀疏等等)。比如,在图中我们可以看到有 11 个菌被除 DESeq2 外的其余四种方法证实,这些菌或许就是下一步需要探究的方向。
Reference
•https://www.nicholas-ollberding.com/post/identifying-differentially-abundant-features-in-microbiome-data/
[1] Limma-Voom: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29
[2] DESeq2: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8
[3] corncob: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-14/issue-1/Modeling-microbial-abundances-and-dysbiosis-with-beta-binomial-regression/10.1214/19-AOAS1283.short
[4] MaAsLin 2: https://www.biorxiv.org/content/10.1101/2021.01.20.427420v1
[5] ANCOM-BC: https://www.nature.com/articles/s41467-020-17041-7
[6] curatedMetagenomicData: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html
[7] 数据: https://bioconductor.org/packages/release/data/experiment/html/curatedMetagenomicData.html
[8] MaAsLin 2: https://github.com/biobakery/Maaslin2