芯片数据集是一片广袤的海洋,大多数时候我们做数据挖掘,相当于大海捞针。于是一些学者合并同类项,将相同芯片平台或是类似设计的数据集进行了包装,开发成为new package,我们可以在这样的包里直接找到想要的数据集,分析的步骤也被大大简化。
其实类似的包并不少,Jimmy老师推荐过几个乳腺癌的表达量数据集的包,都在生信菜鸟团的论坛里(bio-info-trainee.com)。分别是【了解5个乳腺癌表达数据集|你还缺乳腺癌表达量数据集吗?】后面又有同学推荐了这个【这里的乳腺癌表达量矩阵数据集更多】 ,做乳腺癌相关的同学可以去论坛找找看。
今天推荐的Fletcher2013,也是乳腺癌数据集相关的包,一起来看看吧
想不到吧,Fletcher2013是一个兄弟包,有a还有b,(●'◡'●),先看看a吧~
Package ‘Fletcher2013a’ Title Gene expression data from breast cancer cells under FGFR2 signalling perturbation 【 http://dx.doi.org/10.1038/ncomms3464】 Version 1.32.0 Author Mauro Castro, Michael Fletcher, Florian Markowetz and Kerstin Meyer. Description The package Fletcher2013a contains time-course gene expression data from MCF7 cells treated under different experimental systems in order to perturb FGFR2 signalling.
看描述,这个包把这篇文献自己的数据集包装进去了,可以作为内置数据直接使用
好不好用呢,那还得上手试试看才晓得
##安装包
BiocManager::install("Fletcher2013a")
browseVignettes("Fletcher2013a")##查看版本
library(Fletcher2013a)
library(GEOquery)
以Exp1为例,这是对它的描述⬇
The data consists of 46 microarray samples from MCF-7 cells treated under different conditions, at 3 time points (0, 6 and 24 h). The data have been normalized (RMA algorithm) and are presented in the form of an exprSet object. The experiment was carried out on 4 Humanv4 arrays using 12 samples per array. The original arrays contain 48324 features, with an average of 22 beads per feature (Standard Deviation of 5).
# 以Exp1为例 ---------------------------------------
data(Exp1)
gexp<-exprs(Exp1)
geneids<-fData(Exp1)
targets<-pData(Exp1)
mycontrasts<-notes(Exp1)$contrasts
运行完以上的代码,你会发现表达量矩阵,分组信息,探针注释都已尽在掌握,就看你拿它们怎么办了
具体的信息在官方的帮助文档里都有介绍,框起来的是组别信息:
但是如果加上不同时间序列的话,组别就会更多,那这个时候的多组别差异基因分析该如何可视化呢?
我找到了这两篇帖子,兜兜转转还是生信技能树
多分组的差异分析只需要合理设置design矩阵即可- https://cloud.tencent.com/developer/article/1853983 多次差异分析难道就需要多个火山图吗 -https://cloud.tencent.com/developer/article/1853989?from=article.detail.1853983
##这时我们已经获得了矩阵、探针注释和分组信息,应该是可以做差异分析
##构建desig矩阵
group_list<-targets$Target
design=model.matrix(~factor( group_list ))
qw=gsub("\\(.*\\)","",colnames(design)[2:ncol(design)])
qw1=gsub("factor","",qw)
colnames(design)[2:ncol(design)]=qw1
head(design)
#limma差异分析
glm = lmFit(gexp, design = design )
glm = eBayes(glm)
# 差异分析结果可视化 ---------------------------------
#参考:https://cloud.tencent.com/developer/article/1853989?from=article.detail.1853983
testResults <- decideTests(glm, method="hierarchical",adjust.method="BH", p.value=0.05)[,-1]
significantGenes <- sapply(1:ncol(testResults), function(j){
c <- glm$coefficients[testResults[,j]!=0,j+1]
table(cut(c, breaks=c(-2,seq(-1.5,1.5,l=7),2)))
})
colnames(significantGenes) <- colnames(testResults)
# Barplot
# author mg14 , https://rdrr.io/github/mg14/mg14/
rotatedLabel <- function(x0 = seq_along(labels), y0 = rep(par("usr")[3], length(labels)), labels, pos = 1, cex=1, srt=45, ...) {
w <- strwidth(labels, units="user", cex=cex)
h <- strheight(labels, units="user",cex=cex)
u <- par('usr')
p <- par('plt')
f <- par("fin")
xpd <- par("xpd")
par(xpd=NA)
text(x=x0 + ifelse(pos==1, -1,1) * w/2*cos(srt/360*2*base::pi), y = y0 + ifelse(pos==1, -1,1) * w/2 *sin(srt/360*2*base::pi) * (u[4]-u[3])/(u[2]-u[1]) / (p[4]-p[3]) * (p[2]-p[1])* f[1]/f[2] , labels, las=2, cex=cex, pos=pos, adj=1, srt=srt,...)
par(xpd=xpd)
}
par(bty="n", mgp = c(2.5,.33,0), mar=c(3,3.3,2,0)+.1, las=2, tcl=-.25)
b <- barplot(significantGenes, las=2, ylab = "Differentially expressed genes",
col=brewer.pal(8,"RdYlBu"),
legend.text=TRUE,args.legend = list(x = "topright", cex=0.5,title="log2 FC"),
border=0, xaxt="n")#, col = set1[simple.annot[names(n)]], border=NA)
print(b)
rotatedLabel(x0=b, y0=rep(10, ncol(significantGenes)),
labels=colnames(significantGenes), cex=.7, srt=45,
font=ifelse(grepl("[[:lower:]]", colnames(design))[-1], 1,3) )
clip(0,30,0,1000)
# text(b+0.2, colSums(n)+50, colSums(n), pos=3, cex=.7, srt=90)
x0 <- 21.5
image(x=x0+c(0,0.8), y=par("usr")[4]+seq(-100,100,l=9), z=matrix(1:8, ncol=8), col=brewer.pal(8,"RdYlBu"), add=TRUE)
text(x=x0+1.5, y=par("usr")[4]+seq(-50,50,l=3), format(seq(-1,1,l=3),2), cex=0.66)
lines(x=rep(x0+.8,2), y=par("usr")[4]+c(-75,75))
segments(x0+.8,par("usr")[4]+seq(-75,75,l=7),x0+.9,par("usr")[4]+seq(-75,75,l=7))
text(x0+.8, par("usr")[4]+125, "log2 FC", cex=.66)
rotatedLabel(b-0.1, colSums(significantGenes), colSums(significantGenes), pos=3, cex=, srt=45)
为什么是8个“column”呢?
其实前面的mycontrasts就告诉我们了
mycontrasts ##竟然有8次组间比较?
t6.E2-t6.UT | t6.E2FGF10-t6.E2 |
---|---|
t6.E2FGF10PD-t6.E2 | t24.E2-t24.UT |
t24.E2FGF10-t24.E2 | t24.E2FGF10PD-t24.E2 |
t6.UT-t0.UT | t24.UT-t6.UT |
这样的柱形图可以将比较复杂的组间差异分析结果更清楚地展现出来,值得借鉴~
最后,例行来看看分组情况吧
gexp=t(gexp)#画PCA图时要求是行名时样本名,列名时探针名,因此此时需要转换
gexp=as.data.frame(gexp)#将matrix转换为data.frame
library("FactoMineR")#画主成分分析图需要加载这两个包
library("factoextra")
dat.pca <- PCA(gexp , graph = FALSE)#现在dat最后一列是group_list,需要重新赋值给一个dat.pca,这个矩阵是不含有分组信息的
fviz_pca_ind(dat.pca,
geom.ind = "point", # show points only (nbut not "text")
col.ind = group_list, # color by groups
addEllipses = T,
legend.title = "Groups"
)
我的pca
下面的这个函数效果与之类似
Fletcher2013pipeline.pca(Exp1)
人家的pca
看起来还是人家的组间分得更清晰一些
如果想要特定的组间差异基因结果,可以按照
diff1=topTable(glm, coef=2, n=Inf)
diff2=topTable(glm, coef=3, n=Inf)
其中,coef这个参数的描述如下:
参数说明
得到差异结果后,其余流程和二分组数据是一样的,相信大家都不陌生啦!
接下来再看看Fletcher2013a的pro版本
This package reproduces the systems biology analysis for the data in package Fletcher2013a using RTN. 该软件包使用RTN对Fletcher2013a软件包中的数据进行了系统生物学分析。
分析步骤很简单
rm(list = ls())
# BiocManager::install("Fletcher2013b")
library(Fletcher2013b)
library(pheatmap)
##试试看
hits <- Fletcher2013pipeline.deg(what="Exp1")
mra1 <- Fletcher2013pipeline.mra1st(hits=hits$E2FGF10) ##选一组看看
#参考https://zhuanlan.zhihu.com/p/434990993
##挑几个感兴趣的转录因子
risk.tfs=c("PTTG1","SPDEF","ESR1","GATA3","FOXA1")
##对regulon评分
rtnlist=tni.gsea2(rtni1st,regulatoryElements = risk.tfs)
regact=tni.get(rtnlist,what = "regulonActivity")
anno=tni.get(rtnlist,what = "colAnnotation")
colnames(anno)
attribs=c("ER+","ER-","PR+","PR-","LumA", "LumB", "Basal","Her2","Normal")
annot=anno[,attribs]
##画图看看
pheatmap(t(regact$differential),
main = "samples of Exp1_E2FGF10",
annotation_col = annot,show_colnames = F,annotation_legend = FALSE,
clustering_method = "ward.D2",fontsize_row = 6,
clustering_distance_rows = "correlation",clustering_distance_cols = "correlation" )
可以直接得到一个转录因子调控热图,省了不少事啊!
A pipeline to reproduce results for Fletcher et al. 2013. 这个包的pipeline系列函数,把数据的全套流程都封装起来,看起来也是也非常方便,要是可以用来分析自己的数据就太好了,可惜不行。。。
Fletcher2013pipeline.mra1st(hits, minRegulonSize=20, idtype="probeid",
pAdjustMethod="holm",tnet="dpi", eps=0, pValueCutoff=1e-4, verbose=TRUE, ...)
Fletcher2013pipeline.mra2nd(hits, minRegulonSize=20, idtype="probeid",
pAdjustMethod="holm",tnet="dpi", eps=0, pValueCutoff=1e-4, verbose=TRUE, ...)
Fletcher2013pipeline.mraNormals(hits, minRegulonSize=20, idtype="probeid",
pAdjustMethod="holm",tnet="dpi", eps=0, pValueCutoff=1e-4, verbose=TRUE, ...)
Fletcher2013pipeline.mraTALL(hits, minRegulonSize=20, idtype="probeid",
pAdjustMethod="holm", tnet="dpi", eps=0, pValueCutoff=0.01, verbose=TRUE, ...)
Fletcher2013pipeline.consensusnet()
Fletcher2013pipeline.enrichmap()
总的来讲,Fletcher2013a和Fletcher2013b实现了对自身数据集的包装和分析,并且是基于MCF-7的人乳腺癌细胞株进行了不同时间序列下不同药物的干预,或许可作为你的数据挖掘验证的一部分哦!
- END -