0.背景知识
在医学研究中,ROC曲线是一种常用的工具,用于评估分类模型的性能,诊断模型就是分类模型的一种。
这是一篇25分的文献,不过已经是多年前的了。
https://www.atsjournals.org/doi/10.1164/rccm.201502-0355OC
文中的fig.2C:
Dot plot and receiver operating characteristics of the FAIM3:PLAC8 gene expression ratio in discriminating CAP and no-CAP patients. Horizontal red line in dot plot denotes the threshold value (0.757). Area under curve (AUC) and 95% confidence interval (CI) analysis was performed by bootstrap resampling (2,000 stratified replicates). Horizontal black bars denote medians.
与平常的ROC曲线不同的有两个点:
1.预测值不是用机器学习模型预测出来的,也不是一个基因的表达量,而是用两个基因表达量的比值。
2.用”bootstrap resampling (2,000 stratified replicates)” 这个方法来计算置信区间,翻译成中文是:“自助重采样(2,000个分层重复)”。看起来很高级,但是其实这是ROC计算时的一个默认参数,没错默认就是这样计算的
if(!require(pROC))install.packages("pROC")
if(!require(ggplot2))install.packages("ggplot2")
if(!require(devtools))install.packages("devtools")
if(!require(tinyarray))devtools::install_github("xjsun1221/tinyarray",upgrade = FALSE,dependencies = TRUE)
#恼火,最近(2024.5.4)tinyarray因为一些形式主义的问题被移除了,暂时只能用github的方式装了,已经重新投,等待回复ing。
library(pROC)
library(ggplot2)
library(tinyarray)
加载TCGA-KIRC.Rdata
其中包含了基因表达矩阵。
在生信星球公众号对话框回复“953roc”获取我的示例数据
rm(list = ls())
load("TCGA-KIRC.Rdata")
ls()
## [1] "clinical" "exp" "Group" "proj"
#只需用到exp(表达矩阵)
exp[1:4,1:4]
## TCGA-CW-5584-01A-01R-1541-07 TCGA-BP-5195-01A-02R-1426-07
## WASH7P 14 38
## MIR6859-1 2 1
## CICP27 0 1
## AL627309.6 4 23
## TCGA-AK-3455-01A-01R-0864-07 TCGA-BP-4347-01A-01R-1289-07
## WASH7P 120 120
## MIR6859-1 1 9
## CICP27 1 1
## AL627309.6 19 48
library(tinyarray)
g = as.numeric(make_tcga_group(exp))-1
g
## [1] 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## [38] 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1
## [75] 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## [260] 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1
## [297] 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0
## [334] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## [371] 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0
## [445] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1
## [519] 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
## [593] 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
生成分组,表示样本是肿瘤(1)还是正常(0):
以两个基因的表达值比值作为预测值,以PLAC8和TP53为例
predicted = exp["PLAC8",]/exp["TP53",]
使用pROC包中的roc
函数计算ROC曲线对象,并计算AUC及其95%置信区间:
roc_obj <- roc(g, predicted, ci = TRUE);roc_obj
##
## Call:
## roc.default(response = g, predictor = predicted, ci = TRUE)
##
## Data: predicted in 72 controls (g 0) < 541 cases (g 1).
## Area under the curve: 0.785
## 95% CI: 0.7208-0.8493 (DeLong)
aucs <- round(ci.auc(roc_obj),3);aucs
## [1] 0.721 0.785 0.849
使用ggplot2包和pROC包的ggroc
函数来绘制ROC曲线,并添加AUC和95%置信区间的注释:
lb = paste0("AUC:", aucs[2], "\n",
"95%CI:", aucs[1], "-", aucs[3])
ggroc(roc_obj, color = "darkred", legacy.axes = T, linewidth = 1) +
annotate("segment", x = 0, y = 0, xend = 1, yend = 1, linetype = "dashed") +
coord_cartesian(xlim = c(-0.01, 1.01), ylim = c(-0.01, 1.01), expand = F) +
annotate("text", x = 0.9, y = 0.25, label = lb, hjust = 1) +
theme_bw() +
theme(panel.grid = element_blank())
搞定咯