mulea 超越了传统工具,通过整合广泛的本体论,包括基因本体论、通路、调控元件、基因组位置和蛋白质域,涵盖27个模型生物,覆盖来自16个数据库的22种本体论类型和各种标识符,结果在 ELTEbioinformatics/GMT_files_for_mulea GitHub 存储库和通过 muleaData ExperimentData Bioconductor 包中有879个文件可用。
网址:https://github.com/ELTEbioinformatics/mulea
# 设置镜像:
options(BioC_mirror="https://mirrors.westlake.edu.cn/bioconductor")
options("repos"=c(CRAN="https://mirrors.westlake.edu.cn/CRAN/"))
# 安装依赖:
# Installing the fgsea package with the BiocManager
BiocManager::install("fgsea")
# Installing the devtools package if needed
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
# Installing the mulea package from GitHub
devtools::install_github("https://github.com/ELTEbioinformatics/mulea")
加载R包
library(mulea)
library(tidyverse)
这部分展示如果导入各种资源的基因集,例如从Regulon数据库下载的转录因子及其靶基因,并确保研究元素之间的标识符类型(例如,UniProt蛋白质ID、Entrez ID、基因符号、Ensembl基因ID)相匹配。
gmt为tab键分割,一行为一个基因集,可以使用mule包中的 read_gmt 函数进行读取,这个包 中 https://github.com/ELTEbioinformatics/GMT_files_for_mulea 提供了27个物种,16个公开数据库的gmt格式文件。
物种列表:
我们涵盖的物种列表: 拟南芥 (Arabidopsis thaliana) 枯草杆菌 (Bacillus subtilis) 嗜粘蛋白-梭菌 (Bacteroides thetaiotaomicron VPI-5482) 长双歧杆菌 (Bifidobacterium longum) 牛 (Bos taurus) 秀丽隐杆线虫 (Caenorhabditis elegans) 莱茵衣藻 (Chlamydomonas reinhardtii) 斑马鱼 (Danio rerio) 水蚤 (Daphnia pulex) 黏菌 (Dictyostelium discoideum) 果蝇 (Drosophila melanogaster) 拟果蝇 (Drosophila simulans) 大肠杆菌 (Escherichia coli) 鸡 (Gallus gallus) 人类 (Homo sapiens) 猕猴 (Macaca mulatta) 小鼠 (Mus musculus) 结核分枝杆菌 (Mycobacterium tuberculosis) 粗糙链孢霉 (Neurospora crassa) 黑猩猩 (Pan troglodytes) 大鼠 (Rattus norvegicus) 酿酒酵母 (Saccharomyces cerevisiae) 鼠伤寒沙门氏菌 (Salmonella enterica subsp. enterica serovar Typhimurium str. LT2) 裂殖酵母 (Schizosaccharomyces pombe) 嗜热四膜虫 (Tetrahymena thermophila) 热带爪蟾 (Xenopus tropicalis) 玉米 (Zea mays)
直接从github上读取:
# Reading the GMT file from the GitHub repository
tf_ontology <- read_gmt("https://raw.githubusercontent.com/ELTEbioinformatics/GMT_files_for_mulea/main/GMT_files/Escherichia_coli_83333/Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.gmt")
class(tf_ontology)
head(tf_ontology)
或者下载到本地再读取:
# Reading the mulea GMT file locally
tf_ontology <- read_gmt("Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.gmt")
还有另外一种GMT File来自Enricher R包:
# Reading the Enrichr GMT file locally
tf_enrichr_ontology <- read_gmt("TRRUST_Transcription_Factors_2019.txt")
# The ontology_name is empty, therefore we need to fill it with the ontology_id
tf_enrichr_ontology$ontology_name <- tf_enrichr_ontology$ontology_id
还有来自大家都知道的 MsigDB 数据库的GMT File:
# 需要从官网下载下来,默认大家都会从msigdb数据库去下载gmt基因集
# Reading the MsigDB GMT file locally
tf_msigdb_ontology <- read_gmt("c3.tft.v2023.2.Hs.symbols.gmt")
muleaData
内置的基因集# Installing the ExperimentHub package from Bioconductor
# BiocManager::install("ExperimentHub")
# Calling the ExperimentHub library.
library(ExperimentHub)
# Downloading the metadata from ExperimentHub.
eh <- ExperimentHub()
# Creating the muleaData variable.
muleaData <- query(eh, "muleaData")
# Looking for the ExperimentalHub ID of the ontology.
EHID <- mcols(muleaData) %>%
as.data.frame() %>%
dplyr::filter(title == "Transcription_factor_RegulonDB_Escherichia_coli_GeneSymbol.rds") %>%
rownames()
# Reading the ontology from the muleaData package.
tf_ontology <- muleaData[[EHID]]
# Change the header
tf_ontology <- tf_ontology %>%
rename(ontology_id = "ontologyId",
ontology_name = "ontologyName",
list_of_values = "listOfValues")
富集分析的结果有时会因为过于具体或广泛的条目而产生偏差。mulea 允许自定义go term的大小——即属于一个术语的基因或蛋白质的数量——确保你的分析符合你期望的范围。
让我们排除基因少于3个或多于400个的go term:
# Filtering the ontology
tf_ontology_filtered <- filter_ontology(gmt = tf_ontology,
min_nr_of_elements = 3,
max_nr_of_elements = 400)
过滤前后:
使用 **write_gmt
**函数保存gmt数据:
# Saving the ontology to GMT file
write_gmt(gmt = tf_ontology_filtered, file = "Filtered.gmt")
mulea 包提供了 list_to_gmt
函数,用于将基因集列表转换为本体论数据框架。以下示例展示了如何使用这个函数:
# Creating a list of gene sets
ontology_list <- list(gene_set1 = c("gene1", "gene2", "gene3"),
gene_set2 = c("gene4", "gene5", "gene6"))
# Converting the list to a ontology (GMT) object
new_ontology_df <- list_to_gmt(ontology_list)
数据来自GEO:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE55662
次数据的研究调查了大肠杆菌中抗生素抗性进化的情况。比较了环丙沙星ciprofloxacin抗生素处理的大肠杆菌与未处理对照组之间的基因表达变化。使用GEO2R工具比较了这些组的表达水平:
mulea 包使用超几何方法进行富集分析,通过 ora
函数实现,需要三个输入数据:
# Taget set 感兴趣基因列表
target_set <- readLines("mulea-master/inst/extdata/target_set.txt")
target_set
head(target_set)
# [1] "gnsB" "sulA" "recN" "c4435" "dinI" "c2757"
length(target_set)
# [1] 241
# 背景基因集
background_set <- readLines("mulea-master/inst/extdata/background_set.txt")
background_set
length(background_set)
# [1] 7381
运行 10,000 permutations 计算 eFDR
# Creating the ORA model using the GMT variable
ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "eFDR",
# Number of permutations
number_of_permutations = 10000,
# Number of processor threads to use
nthreads = 2,
# Setting a random seed for reproducibility
random_seed = 1)
# Running the ORA
ora_results <- run_test(ora_model)
head(ora_results)
得到的结果如下:
ora_results %>%
# Rows where the eFDR < 0.05
filter(eFDR < 0.05) %>%
# Number of such rows
nrow()
#> [1] 10
ora_results %>%
# Arrange the rows by the eFDR values
arrange(eFDR) %>%
# Rows where the eFDR < 0.05
filter(eFDR < 0.05)
10条term efdr<0.05
mulea 提供了多种可视化工具,包括棒棒糖图、条形图、网络图和热图。这些可视化有效地揭示了富集因子之间的模式和关系。
使用 reshape_results
函数初始化可视化:
棒棒糖图提供了富集转录因子分布的图形表示。y轴显示转录因子,而x轴代表它们对应的eFDR值。点的颜色根据它们的eFDR值进行着色。这种可视化帮助我们检查eFDR的分布,并识别超过常用的0.05显著性阈值的因子。
plot_lollipop(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
结果如下:
条形图提供了与棒棒糖图类似的图形表示。y轴显示富集的本体论类别(例如,转录因子),而x轴代表它们对应的eFDR值。条形的颜色根据它们的eFDR值进行着色,这有助于检查eFDR的分布,并识别超过0.05显著性阈值的因子。
plot_barplot(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
这个函数生成了一个富集本体论类别(例如,转录因子)的网络可视化。每个节点代表一个富集的本体论类别,根据其eFDR值进行着色。如果两个节点至少共享一个属于目标集合的共同基因,则它们之间会绘制一条边,表示共同调控。边的厚度反映了共享基因的数量。
plot_graph(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Upper threshold for the value indicating the significance
p_value_max_threshold = 0.05,
# Column that indicates the significance values
p_value_type_colname = "eFDR")
热图显示了与富集本体论类别(例如,转录因子)相关的基因。每一行代表一个类别,根据其eFDR值进行着色。每一列代表一个来自目标集合的基因,该基因属于富集的本体论类别,表明可能受到一个或多个富集转录因子的调控。
plot_heatmap(reshaped_results = ora_reshaped_results,
# Column containing the names we wish to plot
ontology_id_colname = "ontology_id",
# Column that indicates the significance values
p_value_type_colname = "eFDR")
ora
函数允许选择不同的方法来计算FDR(假发现率)和调整p值:eFDR,以及stats::p.adjust
文档中提供的所有方法选项(holm, hochberg, hommel, bonferroni, BH, BY, 和 fdr)
# Creating the ORA model using the Benjamini-Hochberg p-value correction method
BH_ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "BH")
# Running the ORA
BH_results <- run_test(BH_ora_model)
# Creating the ORA model using the Bonferroni p-value correction method
Bonferroni_ora_model <- ora(gmt = tf_ontology_filtered,
# Test set variable
element_names = target_set,
# Background set variable
background_element_names = background_set,
# p-value adjustment method
p_value_adjustment_method = "bonferroni")
# Running the ORA
Bonferroni_results <- run_test(Bonferroni_ora_model)
使用阈值cutoff<0.05对三种pvalue矫正方法进行比较:
# Merging the Benjamini-Hochberg and eFDR results
merged_results <- BH_results %>%
# Renaming the column
rename(BH_adjusted_p_value = adjusted_p_value) %>%
# Selecting the necessary columns
select(ontology_id, BH_adjusted_p_value) %>%
# Joining with the eFDR results
left_join(ora_results, ., by = "ontology_id") %>%
# Converting the data.frame to a tibble
tibble()
# Merging the Bonferroni results with the merged results
merged_results <- Bonferroni_results %>%
# Renaming the column
rename(Bonferroni_adjusted_p_value = adjusted_p_value) %>%
# Selecting the necessary columns
select(ontology_id, Bonferroni_adjusted_p_value) %>%
# Joining with the eFDR results
left_join(merged_results, ., by = "ontology_id") %>%
# Arranging by the p-value
arrange(p_value)
# filter the p-value < 0.05 results
merged_results_filtered <- merged_results %>%
filter(p_value < 0.05) %>%
# remove the unnecessary columns
select(-ontology_id, -nr_common_with_tested_elements,
-nr_common_with_background_elements)
image-20241213012429398
对显著结果的比较显示,传统的p值校正方法(Benjamini-Hochberg和Bonferroni)往往过于保守,导致与eFDR相比,显著转录因子的数量减少。
正如下图所示,通过应用eFDR,我们能够识别出10个显著的转录因子,而使用Benjamini-Hochberg和Bonferroni校正分别只有7个和3个。这表明,在这种情况下,eFDR可能是控制假阳性的更合适的方法。
感觉此方法唯一的亮点是:资源整合,提供了16个数据库的27个物种的gmt 基因集合,而对于里面的ORA和GSEA分析就是前面两个非常经典的做功能富集分析方法的打包,eFDR就是一个置换检验。这个文章发了几分什么杂志我再看一眼啊:
Turek, Cezary, Márton Ölbei, Tamás Stirling, Gergely Fekete, Ervin Tasnádi, Leila Gul, Balázs Bohár, Balázs Papp, Wiktor Jurkowski, and Eszter Ari. 2024. “mulea: An R Package for Enrichment Analysis Using Multiple Ontologies and Empirical False Discovery Rate.” BMC Bioinformatics 25 (1): 334. https://doi.org/10.1186/s12859-024-05948-7.
BMC Bioinformatics的影响因子在2023年为2.9,在Web of Science的分类中,BMC Bioinformatics位于多个学科类别中。在"Biochemical Research Methods"类别下,位于Q3区。在"Biotechnology & Applied Microbiology"类别中,也处于Q3区。在"Mathematical & Computational Biology"类别中,位于Q2区
后面还可以做GSEA分析,大家快去试试看吧!多一个方法不多,可以扩宽见识。
如果大家打开 enrichKEGG {clusterProfiler} 的帮助文档,可以很明显的看到一个案例数据和代码:
library(clusterProfiler)
data(geneList, package='DOSE')
de <- names(geneList)[1:100]
yy <- enrichKEGG(de, pvalueCutoff=0.01)
head(yy)
这个代码就是加载y叔的成名作clusterProfiler包,然后使用DOSE里面的排序好的基因列表的前100个,然后去做kegg数据库的超几何分布检验。因为kegg数据库版权问题,所以第一次运行这个代码的时候是需要联网下载的:
Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
一般来说,问题也不大。给大家一个作业,就是上面的基因列表的前100个,使用我们今天介绍的新秀mulea包试试看做同样的分析,然后对比两次分析的结果!