RcisTarget 是一个用于基因调控网络构建和转录因子分析的R包。它可以从一组基因中识别潜在的转录因子(Transcription Factors, TFs)调控网络,并通过 motif 分析 推测转录因子的作用机制。该包主要被应用于转录调控研究,特别是基因共表达网络或差异表达基因集的上游调控因素预测。
RcisTarget 的功能主要集中在以下几个方面:
rm(list = ls())
library(RcisTarget)
library(tidyverse)
load("GSE118719_DEG.Rdata")
head(DEG3)
# logFC AveExpr t P.Value adj.P.Val B change
# SOX4 3.348391 6.934971 13.50756 6.378815e-09 0.0002187806 10.790602 UP
# MPP3 3.256332 3.969735 12.04834 2.478566e-08 0.0003780535 9.063578 UP
# C12orf42 -3.361817 2.410388 -11.75646 3.306783e-08 0.0003780535 8.881726 DOWN
# BCAS4 -2.162093 3.827227 -10.65018 1.044466e-07 0.0005934485 8.106005 DOWN
# CHMP7 -1.346611 6.477619 -10.40073 1.372057e-07 0.0005934485 7.996820 DOWN
# C14orf182 -2.611292 3.660219 -10.52694 1.194378e-07 0.0005934485 7.972491 DOWN
motifRanking数据:hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather
官网提示是cisTarget(R)使用rankings,同时建议使用TSS+/-10kb范围的。
Motifs 注释数据:motifAnnotations_hgnc
# 纳入分析的genelists
upgenes <- sample(rownames(DEG3[DEG3$change=="UP",]),20)
downgenes <- sample(rownames(DEG3[DEG3$change=="DOWN",]),20)
genelists <- list("upgenes"= upgenes,"downgenes"=downgenes)
genelists
# $upgenes
# [1] "TNFRSF10B" "RP11-114F10.3" "RP11-274J7.2" "EXTL3-AS1" "RP11-460B17.3" "MEDAG"
# [7] "VSNL1" "AC005042.2" "MAMLD1" "RP11-402J6.1" "CTD-2523D13.2" "FBP2"
# [13] "HOMER1" "TMEM97" "RNU6-579P" "AC027612.4" "MTSS1L" "RP11-120D5.1"
# [19] "ANKRD50" "ADAM5"
#
# $downgenes
# [1] "KDELC1P1" "CTC-265F19.3" "SLC5A5" "PYHIN1" "CTC-782O7.1" "MIR361"
# [7] "RP11-113K21.1" "MIR4306" "CTD-2535I10.1" "RP4-594I10.3" "GRIN2B" "CUBNP2"
# [13] "CTD-2207O23.3" "SLC38A1" "HNRNPA1P17" "LINC00381" "IGHD4-11" "SELL"
# [19] "PPM1M" "RNU6-1245P"
# motif ranking文件
# Gene-motif rankings: 提供了每个motif的所有基因的排名/得分
motifRankings <- importRankings("./database/cisTarget_databases/hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather")
motifRankings
# Rankings for RcisTarget.
# Number of genes: 27090 (27090 available in the full DB)
# Number of MOTIFS: 5876
#
# [Source file: hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather]
# motif的注释
data(motifAnnotations_hgnc)
motifAnnotations[1:3,1:6]
# Key: <motif, TF>
# motif TF directAnnotation inferred_Orthology inferred_MotifSimil annotationSource
# <char> <char> <lgcl> <lgcl> <lgcl> <fctr>
# 1: bergman__Stat92E STAT5A FALSE TRUE FALSE inferredBy_Orthology
# 2: bergman__Stat92E STAT5B FALSE TRUE FALSE inferredBy_Orthology
# 3: bergman__Stat92E STAT6 FALSE TRUE FALSE inferredBy_Orthology
motifEnrichmentTable_wGenes <- cisTarget(genelists,
motifRankings,
motifAnnot=motifAnnotations,
highlightTFs = NULL,
nesThreshold = 3,
geneErnMethod = "iCisTarget", # aprox
geneErnMaxRank = 5000,
nCores = 4,
verbose = TRUE)
motif富集分析
motifs_AUC <- calcAUC(genelists, motifRankings, nCores=8)
motifs_AUC
# AUC for 2 gene-sets and 5876 motifs.
#
# AUC matrix preview:
# bergman__Su_H_ bergman__croc bergman__pho bergman__tll c2h2_zfs__M0369
# upgenes 0.04908867 0.0759253 0 0 0.00000000
# downgenes 0.00196802 0.0000000 0 0 0.04194342
# 识别显著富集的motif:AUC值高于阈值的 motif 通常被认为是显著的
par(mfrow = c(1,2))
pdf("auc.pdf",width = 9,height = 7)
for(i in c("upgenes",'downgenes')){
auc <- getAUC(motifs_AUC)[i,]
hist(auc, main=i, xlab="AUC histogram",
breaks=100, col = "#0000ff50", border = "darkblue")
nes3 <- (3*sd(auc)) + mean(auc) # 计算阈值,阈值设置为均值加上3倍标准差
abline(v=nes3, col="tomato")
}
dev.off()
查看motif的auc分布,超过阈值的motif通常被认识是显著的
motif-TF注释
# xSelect significant motifs and/or annotate motifs to genes or transcription factors. The motifs are considered significantly enriched if they pass the the Normalized Enrichment Score (NES) threshold.
# 显著性motif的选择是基于归一化富集评分( Normalized Enrichment Score,NES)进行的。每个motif的NES是根据基因集中所有基序的AUC分布[(x-mean)/sd]计算。那些通过给定阈值(默认为3.0)的motifs 被认为是重要的。
motifEnrichmentTable <- addMotifAnnotation(motifs_AUC, nesThreshold=3,
motifAnnot=motifAnnotations,
highlightTFs = NULL)
# highlightTFs=list(upgenes="RFX5",downgenes='ZNF274'))
head(motifEnrichmentTable[,-"TF_lowConf", with=FALSE])
# geneSet motif NES AUC highlightedTFs TFinDB TF_highConf
# <char> <char> <num> <num> <char> <char> <char>
# 1: upgenes metacluster_174.5 5.74 0.183 RFX5 ZNF671; ZNF671 (directAnnotation).
# 2: upgenes metacluster_155.27 5.46 0.175 RFX5 ZFX; ZNF667 (directAnnotation).
# 3: upgenes dbtfbs__ZNF660_HEK293_ENCSR283DOU_merged_N1 5.15 0.166 RFX5 ZNF660 (directAnnotation).
# 4: upgenes transfac_pro__M06089 5.08 0.164 RFX5 ZNF26 (directAnnotation).
# 5: upgenes taipale_tf_pairs__TEAD4_CLOCK_NCACGTGNNNNNNNCATWCC_CAP 5.07 0.163 RFX5 CLOCK; TEAD4 (directAnnotation).
# 6: upgenes transfac_pro__M05767 5.04 0.163 RFX5 ZNF77 (directAnnotation).
找出每个基序富集最佳的基因
motifEnrichmentTable_wGenes <- addSignificantGenes(motifEnrichmentTable,
rankings=motifRankings,
geneSets=genelists)
motifEnrichmentTable_wGenes[1:4,1:4]
# geneSet motif NES AUC
# <char> <char> <num> <num>
# 1: upgenes metacluster_174.5 5.74 0.183
# 2: upgenes metacluster_155.27 5.46 0.175
# 3: upgenes dbtfbs__ZNF660_HEK293_ENCSR283DOU_merged_N1 5.15 0.166
# 4: upgenes transfac_pro__M06089 5.08 0.164
上述两种方法都可以获得一个motifEnrichmentTable_wGenes文件
anotatedTfs <- lapply(split(motifEnrichmentTable_wGenes$TF_highConf,
motifEnrichmentTable$geneSet),
function(x) {
genes <- gsub(" \\(.*\\). ", "; ", x, fixed=FALSE)
genesSplit <- unique(unlist(strsplit(genes, "; ")))
return(genesSplit)
})
anotatedTfs
# $downgenes
# [1] "ZNF358" "HOMEZ" "MSX2" "NPAS2" "CLOCK" "MEF2A" "ZSCAN16" "ZNF135" "ZNF26"
# [10] "ZNF564" "ZNF623" "GCM2" "TBX21" "ZNF84" "ZNF776" "ZNF189" "ONECUT2" "TFAP2C"
# [19] "E2F1" "NR1I2" "RXRA" "HINFP" "ZNF880" "RFX1" "RFX2" "RFX3" "RFX4"
# [28] "RFX6" "RFX8" "ERF" "TEAD4" "ATF6" "ZNF787" "ZNF485" "ZNF721" "ETV5"
# [37] "HOXB13" "NKX2-8" "ETV2" "GCM1" "BHLHA15" "ZNF778" "GTF2B" "TBPL2" "GATA3"
# [46] "HOXD12" "LHX8" "STAT1" "ZNF454" "SOX10" "ZEB1" "WT1" "ZNF585B" "ZNF467"
# [55] "ZNF549" "TBX2" "TBX4" "TBX5" "TBXT" "ZNF616" "ZNF599" "YEATS4" "FOS"
# [64] "JUN" "PRDM15" "NR4A1" "ZNF534" "ZNF544" "FOXI1" "PRDM1" "EOMES" "MZF1"
# [73] "ELK1" "ZNF547" "ZNF470" "ELF3" "ZNF136" "ZNF573" "ZNF571" "ZNF384" "ZNF366"
# [82] "HOXA13" "SPI1" "ZBTB4" "ATF1" "DRGX" "ZNF382" "ZNF775" "ZNF697" "ZNF134"
# [91] "ZNF154" "ZNF256" "ZNF304" "ZNF418" "ZNF419" "ZNF480" "ZNF548" "ZNF552" "ZNF561"
# [100] "ZNF562" "ZNF584" "ZNF586" "ZNF587B" "ZNF610" "ZNF772" "ZNF773" "ZNF792" "ZNF793"
# [109] "ZNF837" "ZNF184" "CEBPA" "CEBPB" "CEBPD" "CEBPE" "CEBPG" "DBP" "EP300"
# [118] "GATAD2A" "HES2" "MAF" "ETS1" "ELK3" "ETV4"
#
# $upgenes
# [1] "ZNF319" "ZNF671" "ZFX" "ZNF667" "ZNF26" "ZNF260" "ZNF516" "ZNF791" "ZNF814"
# [10] "OSR1" "FOXH1" "ZNF8" "OVOL2" "CCNT2" "GATA1" "GATA2" "NCOA1" "TAL1"
# [19] "ZNF823" "GATA3" "ELF1" "FOXJ3" "TFE3" "ZNF660" "ZBTB5" "CLOCK" "TEAD4"
# [28] "ZNF77" "ZNF256" "GTF2B" "STAT6" "E2F1" "SOX17" "ZNF420" "ZNF251" "HESX1"
# [37] "MYNN" "CEBPA" "CEBPB" "CEBPD" "CEBPE" "CEBPG" "FOXL1" "ARID5A" "CDX1"
# [46] "CDX2" "ERF" "ONECUT2" "CTNNB1" "ZNF367" "TEAD3" "ZNF585A" "ZNF530" "ZNF2"
# [55] "ZFP1" "ZNF471" "ZNF568" "ZKSCAN2" "FOXO1" "POU5F1" "MAX" "GABPA" "ZNF398"
# [64] "ZNF34" "ZNF394" "ZNF781" "NFATC1" "NFATC2" "NFATC3" "NFATC4" "EOMES" "TBR1"
# [73] "TBX1" "TBX2" "TBX20" "TBX21" "TBX3" "TBXT" "ZNF211" "OVOL1" "GZF1"
# [82] "HSFY2" "SMAD4" "ZBTB20" "NR3C1" "PGR" "HNF4A" "HNF4G" "NR2F1" "NR2F2"
# [91] "BATF3" "FLI1" "ZNF586" "ZNF470" "ZNF205" "PARP1" "ARNTL" "ARNTL2" "ASCL1"
# [100] "BHLHE40" "BHLHE41" "HAND1" "HAND2" "MITF" "MNT" "MXD1" "MXD3" "MXD4"
# [109] "MXI1" "MYC" "MYCN" "MYF5" "MYF6" "MYOD1" "MYOG" "NHLH1" "NPAS2"
# [118] "TAL2" "TCF12" "TCF3" "TCF4" "TFEB" "TFEC" "USF1" "USF2" "DLX3"
# [127] "ZNF202" "HOXB13" "FOXC1" "FIGLA" "GATA4" "GATA6" "ZNF708" "RFX3" "SREBF2"
# [136] "ZNF875" "VENTX" "ZNF468" "ZNF675" "ELK1" "HOXA3" "ETV4" "WT1" "ZNF257"
# [145] "ZNF224" "MEIS1" "ZNF716" "ZNF443" "RBAK" "ZNF445"
motifEnrichmentTable_wGenes_wLogo <- addLogo(motifEnrichmentTable_wGenes)
# 可以把后面的[1:10]去掉
resultsSubset <- motifEnrichmentTable_wGenes_wLogo[1:10,]
library(DT)
datatable(resultsSubset[,-c("enrichedGenes", "TF_lowConf"), with=FALSE],
escape = FALSE, # To show the logo
filter="top", options=list(pageLength=5))
# ----network, cache=FALSE, eval=FALSE-----------------------------------------
signifMotifNames <- motifEnrichmentTable$motif[1:3]
par(mfrow=c(1,3))
incidenceMatrix <- getSignificantGenes(genelists$upgenes,
motifRankings,
signifRankingNames=signifMotifNames,
plotCurve=TRUE, maxRank=5000,
genesFormat="incidMatrix",
method="iCisTarget")$incidMatrix
# 网络图
library(reshape2)
edges <- melt(incidenceMatrix)
edges <- edges[which(edges[,3]==1),1:2]
colnames(edges) <- c("from","to")
library(visNetwork)
motifs <- unique(as.character(edges[,1]))
genes <- unique(as.character(edges[,2]))
nodes <- data.frame(id=c(motifs, genes),
label=c(motifs, genes),
title=c(motifs, genes), # tooltip
shape=c(rep("diamond", length(motifs)), rep("elypse", length(genes))),
color=c(rep("purple", length(motifs)), rep("skyblue", length(genes))))
visNetwork(nodes, edges) %>% visOptions(highlightNearest = TRUE,
nodesIdSelection = TRUE) %>%
visExport(type = "pdf", name = "network_image", label = "Export as PDF") # 添加导出按钮
这两者的差别在于数据分析的注释方式和分析的搜索范围,分别侧重于基因区域(Gene based)或基因组区域(Region based)。
Region based 数据库: 通常基于基因组坐标,注释区域包括增强子、启动子或其他非编码区域。数据库注释的重点在于区域内的 motif 和调控元素,而不依赖于明确的基因信息。
Gene based 数据库: 数据库基于基因注释(如 RefSeq 或 Ensembl),每个基因都关联到标准化的调控区域。数据库直接映射到特定基因,便于生成基因调控网络。
————————————————————————————
hg38_10kbp_up_10kbp_down_full_tx_v10_clust.genes_vs_motifs.rankings.feather等的motif ranking文件中存在7或者10物种的区别,一般会建议使用10物种的,这里面就涉及到多物种的保守性问题。
保守性指的是一个基因、序列、调控元件或结构在多个物种中保持相似或不发生显著变化的程度。这通常反映该序列在进化过程中受到了自然选择的保护,可能具有重要的生物学功能。
在本文段落中,保守性主要是指调控区域或motif在不同物种中是否存在相似性或一致性。具体来说:
————————————————————————————
这里的 Scoring/search space是指在分析基因调控区域时,围绕 转录起始位点(TSS, Transcription Start Site) 定义的基因组范围,用于确定在什么位置寻找潜在的 调控元件(如转录因子结合位点或其他 motif)。
致谢:感谢曾老师以及生信技能树团队全体成员。
注:若对内容有疑惑或者有发现明确错误的朋友,请联系后台(欢迎交流)。更多内容可关注公众号:生信方舟
- END -
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。