文章图表复现：两个基因的比值做分类模型并画ROC曲线

用户11414625

发布于 2024-12-20 15:27:47

9000

代码可运行

文章被收录于专栏：生信星球520生信星球520

运行总次数：0

代码可运行

0.背景知识

在医学研究中，ROC曲线是一种常用的工具，用于评估分类模型的性能，诊断模型就是分类模型的一种。

这是一篇25分的文献，不过已经是多年前的了。

https://www.atsjournals.org/doi/10.1164/rccm.201502-0355OC

文中的fig.2C：

Dot plot and receiver operating characteristics of the FAIM3:PLAC8 gene expression ratio in discriminating CAP and no-CAP patients. Horizontal red line in dot plot denotes the threshold value (0.757). Area under curve (AUC) and 95% confidence interval (CI) analysis was performed by bootstrap resampling (2,000 stratified replicates). Horizontal black bars denote medians.

与平常的ROC曲线不同的有两个点：

1.预测值不是用机器学习模型预测出来的，也不是一个基因的表达量，而是用两个基因表达量的比值。

2.用”bootstrap resampling (2,000 stratified replicates)” 这个方法来计算置信区间，翻译成中文是：“自助重采样（2,000个分层重复）”。看起来很高级，但是其实这是ROC计算时的一个默认参数，没错默认就是这样计算的

1.安装和加载R包

if(!require(pROC))install.packages("pROC")
if(!require(ggplot2))install.packages("ggplot2")
if(!require(devtools))install.packages("devtools")
if(!require(tinyarray))devtools::install_github("xjsun1221/tinyarray",upgrade = FALSE,dependencies = TRUE)
#恼火，最近(2024.5.4)tinyarray因为一些形式主义的问题被移除了，暂时只能用github的方式装了，已经重新投，等待回复ing。
library(pROC)
library(ggplot2)
library(tinyarray)

2.加载数据

加载TCGA-KIRC.Rdata其中包含了基因表达矩阵。

在生信星球公众号对话框回复“953roc”获取我的示例数据

rm(list = ls())
load("TCGA-KIRC.Rdata")
ls()

## [1] "clinical" "exp"      "Group"    "proj"

#只需用到exp（表达矩阵）
exp[1:4,1:4]

##            TCGA-CW-5584-01A-01R-1541-07 TCGA-BP-5195-01A-02R-1426-07
## WASH7P                               14                           38
## MIR6859-1                             2                            1
## CICP27                                0                            1
## AL627309.6                            4                           23
##            TCGA-AK-3455-01A-01R-0864-07 TCGA-BP-4347-01A-01R-1289-07
## WASH7P                              120                          120
## MIR6859-1                             1                            9
## CICP27                                1                            1
## AL627309.6                           19                           48

library(tinyarray)
g = as.numeric(make_tcga_group(exp))-1
g

##   [1] 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
##  [38] 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1
##  [75] 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1
## [260] 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1
## [297] 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0
## [334] 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
## [371] 1 1 0 1 1 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 0
## [445] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1
## [482] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1
## [519] 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0
## [556] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0
## [593] 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

生成分组，表示样本是肿瘤（1）还是正常（0）：

3.计算预测值

以两个基因的表达值比值作为预测值，以PLAC8和TP53为例

predicted = exp["PLAC8",]/exp["TP53",]

4.计算ROC曲线和AUC

使用pROC包中的roc函数计算ROC曲线对象，并计算AUC及其95%置信区间：

roc_obj <- roc(g, predicted, ci = TRUE);roc_obj

## 
## Call:
## roc.default(response = g, predictor = predicted, ci = TRUE)
## 
## Data: predicted in 72 controls (g 0) < 541 cases (g 1).
## Area under the curve: 0.785
## 95% CI: 0.7208-0.8493 (DeLong)

aucs <- round(ci.auc(roc_obj),3);aucs

## [1] 0.721 0.785 0.849

5.绘制ROC曲线

使用ggplot2包和pROC包的ggroc函数来绘制ROC曲线，并添加AUC和95%置信区间的注释：

lb = paste0("AUC:", aucs[2], "\n",
            "95%CI:", aucs[1], "-", aucs[3])
ggroc(roc_obj, color = "darkred", legacy.axes = T, linewidth = 1) +
  annotate("segment", x = 0, y = 0, xend = 1, yend = 1, linetype = "dashed") +
  coord_cartesian(xlim = c(-0.01, 1.01), ylim = c(-0.01, 1.01), expand = F) +
  annotate("text", x = 0.9, y = 0.25, label = lb, hjust = 1) +
  theme_bw() +
  theme(panel.grid = element_blank())