在做项目时,曾有小伙伴对我用edgeR进行差异分析筛选出的具体显著差异基因表示质疑,因为发表的文章清楚的说明某个基因是差异基因,但是我edgeR的分析结果并没有表明。在小伙伴的质疑下,我认真看了下文章,发现文章用的是DEseq2进行差异分析。值得注意的是该小伙伴关注的差异基因是一个离散比较大的基因,此处的离散较大可以理解为假定对照组为5,6,7;实验组则为14,13,3的情况。那为什么这个基因在edgeR分析下不是显著差异基因,然而在DEseq2的分析下是差异基因呢?这应该很大程度源于算法判定显著差异基因的区别。接着,我看了关于DEseq2与edgeR区别的描述,发现「edgeR与Deseq2都是基于负二项分布模型做的,两者处理同一组数据时,相同阈值处理大部分基因是一样的,但是也会有一部分基因会因为离散度不同导致差异不同」,如刚刚示例的基因离散度被DEseq2识别为差异,但是不被edgeR识别,所以两种算法获取的差异基因与数目是存在细微区别的。
小伙伴们可能也会好奇,除了上述的edgeR与DEseq2,不是还有第三种差异分析方法imma吗,它分析的结果与前两种差异分析方法有什么区别?「具体分析时,我们应该从三种差异分析方法中选择何种方法进行分析呢?」
小编觉得,如果不是针对特定的基因去找,其实这三种差异分析方法都可以(虽然目前推DEseq2与edgeR的比较多些)。但是,如果说你要根据固定的基因去选择,你可以尝试一下三种差异分析的方法,看看效果再决定。
在本文中,我们拟通过三个「check上调基因的箱线图」说明三种差异分析方法没有造成上下调差异基因结果相反的情况;通过「Veen图」查看了差异基因在三种差异分析方法间的交集情况,通过「相关性分析」看看不同差异分析方法分析共同差异基因logFC的相关性。接下来就让我们探究一下三种差异分析方法(DEseq2、edgeR与limma)在转录组差异分析的方式与具体流程吧。
今天,我们使用标题为 Single base-pair resolution analysis of binding motif with diffMotif uncovers the oncogenic impact of CTCF mutations in breast cancer 的数据集GSE190114进行探究,数据集的介绍链接如下https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190114,感兴趣的小伙伴可以点进去看看作者的总体设计。在此,小编对文章进行简单归纳,作者主要通过转录组测序探究了CTCF锌指结构的突变对于乳腺癌的影响,使用的是MCF10A乳腺癌细胞系。
GSE190114数据集的样本分组如下,三个分组三个重复样本,我们重点对前两个分组的重复样本进行差异分析
处理数据的话,作者上传了基因count矩阵,我们就可以直接走基因count矩阵的差异分析流程进行分析,链接见如何做单样本之间的差异分析。
以下为基因count矩阵下载链接:https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE190114&format=file&file=GSE190114%5FRNA%5FSeq%5FCountMatrix%2Etxt%2Egz;感兴趣的小伙伴可以下载试试。
rm(list = ls())
library(edgeR)
library(DESeq2)
library(FactoMineR)
library(factoextra)
library(clusterProfiler)
library(org.Hs.eg.db)
library(org.Mm.eg.db)
library(stringr)
library(stringi)
library(tidyverse)
library(ggplot2)
library(patchwork)
library(pheatmap)
library(VennDiagram)
library(RColorBrewer)
library(patchwork)
library(ggplotify)
rawcount <- read.table("./GSE190114_RNA_Seq_CountMatrix.txt",header = T,sep="\t",row.names = 1)
colnames(rawcount)
## [1] "WT" "WT.2" "WT.3" "KI2.1" "KI2.2" "KI2.3" "KIKI.1" "KIKI.2"
## [9] "KIKI.3"
rawcount[1:4,1:4]
## WT WT.2 WT.3 KI2.1
## A1BG 176 292 151 235
## A1BG-AS1 242 366 179 361
## A1CF 2 5 2 18
## A2M 5 1 3 18
# 本身就是基因表达矩阵(无需降重与ID转换);选择二分组的样本
rawcount=rawcount[,1:6]
keep <- rowSums(rawcount>0) >= floor(0.75*ncol(rawcount))
filter_count <- rawcount[keep,] #获得filter_count矩阵
express_cpm <- log2(cpm(filter_count)+ 1)
express_cpm[1:4,1:4] #获得cpm矩阵
## WT WT.2 WT.3 KI2.1
## A1BG 1.71773149 1.73333170 1.62106406 1.7477082
## A1BG-AS1 2.05228844 1.96869792 1.79117863 2.2087259
## A1CF 0.03704971 0.05632098 0.03913408 0.2395589
## A2M 0.09089915 0.01144147 0.05831012 0.2395589
#根据列的信息,提取分组信息
colnames(rawcount)
## [1] "WT" "WT.2" "WT.3" "KI2.1" "KI2.2" "KI2.3"
group=rep(c("WT","KI2"),each=3)
group
## [1] "WT" "WT" "WT" "KI2" "KI2" "KI2"
group_list=factor(group,levels = c("WT","KI2"))
table(group_list)#检查一下组别数量
## group_list
## WT KI2
## 3 3
## 01绘制整体表达的箱线图
exprSet <- express_cpm
class(express_cpm)
## [1] "matrix" "array"
data <- data.frame(expression=c(exprSet),
sample=rep(colnames(exprSet),each=nrow(exprSet)))
p1 <- ggplot(data = data,aes(x=sample,y=expression,fill=sample))+ geom_boxplot() + theme(axis.text.x = element_text(angle = 90)) +
xlab(NULL) + ylab("log2(CPM+1)")+theme_bw()
## 02绘制PCA图
dat <- express_cpm
dat <- as.data.frame(t(dat))
dat <- na.omit(dat)
dat$group_list <- group_list
dat_pca <- PCA(dat[,-ncol(dat)], graph = FALSE)#画图仅需数值型数据,去掉最后一列的分组信息
p2 <- fviz_pca_ind(dat_pca,
geom.ind = "point", # 只显示点,不显示文字
col.ind = dat$group_list, # 用不同颜色表示分组
palette = c("#00AFBB", "#E7B800"),
addEllipses = T, # 是否圈起来,少于4个样圈不起来
legend.title = "Groups") + theme_bw()
p1+p2
exprSet <- filter_count
design <- model.matrix(~0+rev(factor(group_list)))
rownames(design) <- colnames(exprSet)
colnames(design) <- levels(factor(group_list))
DEG <- DGEList(counts=exprSet, #构建edgeR的DGEList对象
group=factor(group_list))
DEG$samples$lib.size
## [1] 76883155 125594216 72735392 99650302 71125317 76094379
DEG <- calcNormFactors(DEG)
DEG$samples$norm.factors
## [1] 0.9866259 0.9777153 0.9815402 1.0129029 1.0155286 1.0267555
# 计算线性模型的参数
DEG <- estimateGLMCommonDisp(DEG,design)
DEG <- estimateGLMTrendedDisp(DEG, design)
DEG <- estimateGLMTagwiseDisp(DEG, design)
# 拟合线性模型
fit <- glmFit(DEG, design)
lrt <- glmLRT(fit, contrast=c(1,-1))
# 提取过滤差异分析结果
DEG_edgeR <- as.data.frame(topTags(lrt, n=nrow(DEG)))
head(DEG_edgeR)
## logFC logCPM LR PValue FDR
## POSTN -11.242714 5.157621 7274.735 0 0
## PDE3B -10.528824 2.818691 2202.143 0 0
## SNURF -8.407920 4.325858 6381.568 0 0
## SNRPN -8.407659 4.325598 6382.522 0 0
## CADM3 -7.784327 5.237820 8566.661 0 0
## CADM3-AS1 -7.618800 4.317397 5508.679 0 0
# 设定阈值,筛选显著上下调差异基因
fc <- 2.0
p <- 0.05
DEG_edgeR$regulated <- ifelse(DEG_edgeR$logFC>log2(fc)&DEG_edgeR$PValue<p,
"up",ifelse(DEG_edgeR$logFC<(-log2(fc))&DEG_edgeR$PValue<p,"down","normal"))
## 检查edgeR的顺序有没有弄反(易错点哈)
up=head(DEG_edgeR[DEG_edgeR$regulated=="up",])[1,] #直接选取第一列上调基因
upgene <- rownames(up)
exp <- c(t(express_cpm[match(upgene,rownames(express_cpm)),]))
exp
## [1] 0.5794278 0.6242658 0.4698412 3.9551960 3.8294548 3.9683540
test <- data.frame(value=exp, group=group_list)
library(ggplot2)
check_edgeR=ggplot(data=test,aes(x=group,y=value,fill=group)) + geom_boxplot()
# 查看分组信息和表达矩阵数据
exprSet <- filter_count
dim(exprSet)
## [1] 18206 6
table(group_list)
## group_list
## WT KI2
## 3 3
# 加载包
library(DESeq2)
# 第一步,构建DESeq2的DESeq对象(建立DEseq数据矩阵)
colData <- data.frame(row.names=colnames(exprSet),group_list=group_list)
dds <- DESeqDataSetFromMatrix(countData = exprSet,colData = colData,design = ~ group_list)
# 第二步,进行差异表达分析
dds2 <- DESeq(dds)
# 提取差异分析结果,trt组对untrt组的差异分析结果
tmp <- results(dds2,contrast=c("group_list","KI2","WT"))
DEG_DESeq2 <- as.data.frame(tmp[order(tmp$padj),])
head(DEG_DESeq2)
## baseMean log2FoldChange lfcSE stat pvalue padj
## ABCC2 8345.615 2.813938 0.03718825 75.66738 0 0
## ADGRL1 2542.280 -2.059138 0.05003988 -41.14994 0 0
## AFAP1L1 2330.490 -2.138101 0.05388025 -39.68246 0 0
## ALDH3A1 11106.359 2.713628 0.03726390 72.82189 0 0
## CADM3 3188.967 -7.792886 0.16262614 -47.91903 0 0
## CD99L2 7041.834 -1.545442 0.03461663 -44.64450 0 0
# 去除差异分析结果中包含NA值的行
DEG_DESeq2 = na.omit(DEG_DESeq2)
# 筛选上下调差异基因,设定阈值
fc_cutoff <-2.0
fdr <- 0.05 #p值
DEG_DESeq2$regulated <- "normal"
loc_up <- intersect(which(DEG_DESeq2$log2FoldChange>log2(fc_cutoff)),
which(DEG_DESeq2$pvalue<fdr))
loc_down <- intersect(which(DEG_DESeq2$log2FoldChange< (-log2(fc_cutoff))),
which(DEG_DESeq2$pvalue<fdr))
DEG_DESeq2$regulated[loc_up] <- "up"
DEG_DESeq2$regulated[loc_down] <- "down"
table(DEG_DESeq2$regulated)
##
## down normal up
## 694 15679 1127
DEG_DESeq2$genenames <- rownames(DEG_DESeq2)
DEG_DESeq2 <- select(DEG_DESeq2,genenames,everything())
## 取一个显著上调基因,看看其在标准化的数据中是否上调
up=head(DEG_DESeq2[DEG_DESeq2$regulated=="up",])[2,] #直接选取第一列上调基因
upgene <- rownames(up)
exp <- c(t(express_cpm[match(upgene,rownames(express_cpm)),]))
test <- data.frame(value=exp, group=group_list)
# 绘制检查箱线图
library(ggplot2)
check_DEseq2=ggplot(data=test,aes(x=group,y=value,fill=group)) + geom_boxplot()
###limma----voom差异分析
library(limma)
exp=filter_count
Group=group_list
dge <- edgeR::DGEList(counts=exp)
dge <- edgeR::calcNormFactors(dge)
design <- model.matrix(~Group)
v <- voom(dge,design, normalize="quantile")
fit <- lmFit(v, design)
fit= eBayes(fit)
DEG_limma = topTable(fit, coef=2, n=Inf)
DEG_limma = na.omit(DEG_limma)
pvalue_t=0.05
logFC_t=1
k1 = (DEG_limma$P.Value < pvalue_t)&(DEG_limma$logFC < -logFC_t)
k2 = (DEG_limma$P.Value < pvalue_t)&(DEG_limma$logFC > logFC_t)
DEG_limma$regulated = ifelse(k1,"down",ifelse(k2,"up","not"))
table(DEG_limma$regulated)
##
## down not up
## 896 16211 1099
head(DEG_limma[DEG_limma$regulated=="up",])
## logFC AveExpr t P.Value adj.P.Val B regulated
## ALDH3A1 2.763111 6.481185 77.56984 1.139142e-21 3.963795e-18 39.88181 up
## ABCC2 2.861579 6.032458 76.89266 1.306315e-21 3.963795e-18 39.61690 up
## FTL 1.248969 9.154473 63.03928 2.901312e-20 5.869031e-17 36.85718 up
## EPGN 1.399917 8.304812 60.89337 4.979348e-20 9.065401e-17 36.33538 up
## NQO1 1.557891 9.092813 58.42708 9.486858e-20 1.570161e-16 35.65063 up
## TP53I3 1.662681 6.665246 55.92285 1.877716e-19 2.848808e-16 35.01935 up
up=head(DEG_limma[DEG_limma$regulated=="up",])[1,] #直接选取第一列上调基因
upgene <- rownames(up)
exp <- c(t(express_cpm[match(upgene,rownames(express_cpm)),]))
exp
## [1] 5.129804 5.091980 5.164549 7.864878 7.901136 7.812083
test <- data.frame(value=exp, group=group_list)
# 绘制检查箱线图
library(ggplot2)
check_limma=ggplot(data=test,aes(x=group,y=value,fill=group)) + geom_boxplot()
check_DEseq2+check_edgeR+check_limma
## 01绘制DEG_edgeR的普通版火山图
data <- DEG_edgeR
colnames(data)
## [1] "logFC" "logCPM" "LR" "PValue" "FDR" "regulated"
library(ggplot2)
p1.1 <- ggplot(data=data, aes(x=logFC, y=-log10(PValue),color=regulated)) +
geom_point(alpha=0.5, size=1.8) +
theme_set(theme_set(theme_bw(base_size=20))) +
xlab("log2FC") + ylab("-log10(Pvalue)") +
scale_colour_manual(values = c('blue','black','red'))+theme_bw()
## 02绘制DEG_DESeq2的普通版火山图
data <- DEG_DESeq2
colnames(data)
## [1] "genenames" "baseMean" "log2FoldChange" "lfcSE"
## [5] "stat" "pvalue" "padj" "regulated"
library(ggplot2)
p1.2 <- ggplot(data=data, aes(x=log2FoldChange, y=-log10(pvalue),color=regulated)) +
geom_point(alpha=0.5, size=1.8) +
theme_set(theme_set(theme_bw(base_size=20))) +
xlab("log2FC") + ylab("-log10(Pvalue)") +
scale_colour_manual(values = c('blue','black','red'))+theme_bw()
## 03绘制DEG_limma的普通版火山图
data <- DEG_limma
colnames(data)
## [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val" "B"
## [7] "regulated"
library(ggplot2)
p1.3 <- ggplot(data=data, aes(x=logFC, y=-log10(P.Value),color=regulated)) +
geom_point(alpha=0.5, size=1.8) +
theme_set(theme_set(theme_bw(base_size=20))) +
xlab("log2FC") + ylab("-log10(Pvalue)") +
scale_colour_manual(values = c('blue','black','red'))+theme_bw()
p1.1+p1.2+p1.3
## 由于图表实在太多,Veen图主看下差异基因,感兴趣的可以自身看下上下调差异基因
deg_edger= rownames(DEG_edgeR[DEG_edgeR$regulated!="normal",])
deg_deseq2= rownames(DEG_DESeq2[DEG_DESeq2$regulated!="normal",])
deg_limma= rownames(DEG_limma[DEG_limma$regulated!="not",])
library(ggvenn)
deg<-list(`deg_edger`=deg_edger,
`deg_deseq2`=deg_deseq2,
`deg_limma`=deg_limma)
p2 <- ggvenn(deg,
show_percentage = F,
stroke_color = "white",
fill_color = c("#b2d4ec","#ffb2b2","#d3c0e2"),
set_name_color = c("#ff0000","#4a9b83","#b2e7cb"))
p2
##### 1.先比较edger与DEseq2差异分析的相关性
# 01看一看分别比较的整体相关性(此处就看整体差异基因吧)
library(ggstatsplot)
library(ggpmisc)
ids1 = intersect(deg_edger,deg_deseq2) #只有40条,所以得换用富集方法
# 02构建可视化所需的矩阵(相关性就两行值)
df1=data.frame(
deg1=DEG_edgeR[ids1,"logFC"],
deg2=DEG_DESeq2[ids1,"log2FoldChange"] #就是两列logFC的值(散点图)
)
colnames(DEG_DESeq2)
## [1] "genenames" "baseMean" "log2FoldChange" "lfcSE"
## [5] "stat" "pvalue" "padj" "regulated"
# 03绘制相关性图形
p3.1 <- ggplot(df1, aes(x=df1$deg1,
df1$deg2))+
geom_point()+
labs(x = "DEG_edgeR",
y = "DEG_DESeq2")+
scale_color_manual(values=c("blue", "grey","red"))+
geom_vline(xintercept = c( -1.5,0,1.5),lty=4,col="#ec183b",lwd=0.8) +
geom_hline(yintercept = c( -1.5,0,1.5),lty=4,col="#18a5ec",lwd=0.8) +
#xlim(-5,5)+
#ylim(-3,3)+
theme_ggstatsplot()+
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
p3.1=p3.1+stat_poly_eq(aes(label=paste(..eq.label..,..adj.rr.label..,..p.value.label..,sep = "~~~~")),formula = y~x,parse=T,size=3.0)
p3.1
##### 2.比较edger与limma差异分析的相关性
ids2 = intersect(deg_edger,deg_limma)
# 02构建可视化所需的矩阵(相关性就两行值)
df2=data.frame(
deg1=DEG_edgeR[ids2,"logFC"],
deg2=DEG_limma[ids2,"logFC"] #就是两列logFC的值(散点图)
)
colnames(DEG_limma)
## [1] "logFC" "AveExpr" "t" "P.Value" "adj.P.Val" "B"
## [7] "regulated"
# 03绘制相关性图形
p3.2 <- ggplot(df2, aes(x=df2$deg1,
df2$deg2))+
geom_point()+
labs(x = "DEG_edgeR",
y = "DEG_limma")+
scale_color_manual(values=c("blue", "grey","red"))+
geom_vline(xintercept = c( -1.5,0,1.5),lty=4,col="#ec183b",lwd=0.8) +
geom_hline(yintercept = c( -1.5,0,1.5),lty=4,col="#18a5ec",lwd=0.8) +
#xlim(-5,5)+
#ylim(-3,3)+
theme_ggstatsplot()+
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
p3.2=p3.2+stat_poly_eq(aes(label=paste(..eq.label..,..adj.rr.label..,..p.value.label..,sep = "~~~~")),formula = y~x,parse=T,size=3.0)
p3.2
##### 3.比较DEseq2与limma差异分析的相关性
ids3 = intersect(deg_deseq2,deg_limma)
# 02构建可视化所需的矩阵(相关性就两行值)
df3=data.frame(
deg1=DEG_DESeq2[ids3,"log2FoldChange"],
deg2=DEG_limma[ids3,"logFC"] #就是两列logFC的值(散点图)
)
colnames(DEG_DESeq2)
## [1] "genenames" "baseMean" "log2FoldChange" "lfcSE"
## [5] "stat" "pvalue" "padj" "regulated"
# 03绘制相关性图形
p3.3 <- ggplot(df3, aes(x=df3$deg1,
df3$deg2))+
geom_point()+
labs(x = "DEG_DESeq2",
y = "DEG_limma")+
scale_color_manual(values=c("blue", "grey","red"))+
geom_vline(xintercept = c( -1.5,0,1.5),lty=4,col="#ec183b",lwd=0.8) +
geom_hline(yintercept = c( -1.5,0,1.5),lty=4,col="#18a5ec",lwd=0.8) +
#xlim(-5,5)+
#ylim(-3,3)+
theme_ggstatsplot()+
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
p3.3=p3.3+stat_poly_eq(aes(label=paste(..eq.label..,..adj.rr.label..,..p.value.label..,sep = "~~~~")),formula = y~x,parse=T,size=3.0)
p3.3
## 拼图看一下,样本之间的相关性
p3.1+p3.2+p3.3
以上就是转录组常见的三种差异分析方法以及三种差异分析方法的区别。
「总结:」从韦恩图中可见,三种差异分析的差异基因大部分一样,但是因为判定的标准不同,有些差异基因在某些方法中是差异基因,在某些方法中不是差异基因。相关性分析结果表明,三种差异分析方法中两种差异分析获得的共同差异基因的logFC判定具有非常强的相关性,表明它们的趋势基本完全相同。
感兴趣的小伙伴可以尝试分析一下这个数据集哈。在分析自身课题的转录组结果时,可以试试三种方式去分析,多探究探究自己的数据,看看能否获得自身感兴趣的结果。