上期展示了ESITMATE
(基于转录组数据)计算免疫得分和肿瘤纯度的一个例子,详见ggplot2实现分半小提琴图绘制基因表达谱和免疫得分。实际上计算肿瘤纯度的方法还有InfiniumPurify
(基于甲基化数据)、ABSOLUTE
(基于体细胞拷贝数变异)、PurityEst
(基于突变数据)等等,而计算免疫浸润的有Cibersort
、ssGSEA
、TIMER
等算法。
本期我们介绍一下简单易上手的Cibersort
算法。CIBERSORTx
(https://cibersortx.stanford.edu/)是Newman等人开发的一种分析工具,可基于基因表达数据估算免疫细胞的丰度:
本文主要从以下两个模块进行介绍:
一. Cibersort
算法的纯代码实现及结果绘图展示;
二. Cibersort
算法的网页版实现。
在模块一绘图部分,笔者还根据一篇文献,将得到的免疫细胞分为4大类:Total lymphocytes
,Total dendritic cell
,Total macrophage
及Total mast cell
。
沿用前几期的示例数据,本期使用的还是TCGA头颈癌的43对癌和癌旁FPKM数据。参考文献《Profiling tumor infiltrating immune cells with CIBERSORT》
,可知Cibersort
对于输入数据的要求如下:
值得注意的是,虽然文献建议输入数据要求是non-log linear space
,但是检查Cibersort
可看到有一段函数if(max(Y) < 50) {Y <- 2^Y}
(见模块一的源代码),专门用于检测输入数据是否为log化后的输入数据,如果是(max < 50)
,则会自动还原为log前的数据。因此,用户不必纠结于log
与否。
总结来说,输入数据是芯片数据还是RNA-seq数据都可以,RNA-seq数据不建议直接使用Count
,应使用FPKM
和TPM
或DESEq2标准化后的矩阵
为宜。
但输入数据的基因名字需要为Gene symbol
注意不能有重复的基因名和缺失/负值数据,其他具体的一些细节,还是需要用户去浏览上面那篇文献。
如何处理重复的基因名,如何得到TCGA的FPKM数据,可以参考前几期的推文,利用R代码从UCSC XENA下载mRNA, lncRNA, miRNA表达数据并匹配临床信息。
Cibersort
算法的纯代码实现及结果绘图展示remove(list = ls()) #一键清空
#加载包
library(ggplot2)
library(reshape2)
library(ggpubr)
library(dplyr)
加载cibersort
的官方提供的源码,指定基准数据库文件 (LM22.txt
,这是22种免疫细胞的marker基因,下载自Cibersort官网)。
source('./assist/Cibersort.R')
# 设置分析依赖的基础表达文件
# 每类免疫细胞的标志性基因及其表达
# 基因名字为Gene symbol
LM22.file <- "./database/LM22.txt"
加载自己的数据用于分析计算免疫细胞
# 1. Cibersort
TCGA_exp.file <- "./Rawdata/TCGA_HNSC_mRNA_fpkm_paired_43vs43.txt"
TCGA_TME.results <- CIBERSORT(LM22.file ,TCGA_exp.file, perm = 50, QN = F)
# perm置换次数=1000
# QN如果是芯片设置为T,如果是测序就设置为F
write.csv(TCGA_TME.results, "./Output/TCGA_CIBERSORT_Results_fromRcode.csv")
Permutations for significance analysis是用来计算单个样本估算免疫浸润的p
值,大多数文章会采用1000次。数值越大,运行时间越久,这里笔者为了加速运行的速度,选择了50次 (笔记本运行耗时5
分钟)。
## 2. 分组信息
# TCGA的数据还可以从名字获取
# group_list <- ifelse(as.numeric(substring(rownames(TCGA_TME.results),14,15)) < 10,
# "Tumor","Normal") %>%
# factor(.,levels = c("Normal","Tumor"))
phenotype = read.csv("./Rawdata/TCGA_HNSC_paired_metadata.csv",header = T,row.names = 1)
group_list <- phenotype$group %>%
factor(.,levels = c("Nontumor","Tumor"))
table(group_list) # Normal 43 Tumor 43
## group_list
## Nontumor Tumor
## 43 43
3.1 数据转换预处理,取前22列,忽略掉后面计算出的P-value
,Correlation
, RMSE
单列信息。
## 3. 绘图
# 3.1 数据粗处理
TME_data <- as.data.frame(TCGA_TME.results[,1:22])
TME_data$group <- group_list
TME_data$sample <- row.names(TME_data)
# 2.2 融合数据
TME_New = melt(TME_data)
## Using group, sample as id variables
colnames(TME_New)=c("Group","Sample","Celltype","Composition") #设置行名
head(TME_New)
## Group Sample Celltype Composition
## 1 Tumor TCGA.CV.6943.01 B cells naive 0.007651678
## 2 Tumor TCGA.CV.6959.01 B cells naive 0.019549031
## 3 Nontumor TCGA.CV.7438.11 B cells naive 0.025349204
## 4 Nontumor TCGA.CV.7242.11 B cells naive 0.032583659
## 5 Tumor TCGA.CV.7432.01 B cells naive 0.000000000
## 6 Nontumor TCGA.CV.6939.11 B cells naive 0.074282293
3.2 按免疫细胞占比中位数排序绘图(可选)
# 3.3 按免疫细胞占比中位数排序绘图(可选)
plot_order = TME_New[TME_New$Group=="Tumor",] %>%
group_by(Celltype) %>%
summarise(m = median(Composition)) %>%
arrange(desc(m)) %>%
pull(Celltype)
## `summarise()` ungrouping output (override with `.groups` argument)
TME_New$Celltype = factor(TME_New$Celltype,levels = plot_order)
3.3 绘制箱线图
# 3.3 出图
if(T){
mytheme <- theme(plot.title = element_text(size = 12,color="black",hjust = 0.5),
axis.title = element_text(size = 12,color ="black"),
axis.text = element_text(size= 12,color = "black"),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1 ),
panel.grid=element_blank(),
legend.position = "top",
legend.text = element_text(size= 12),
legend.title= element_text(size= 12)
) }
box_TME <- ggplot(TME_New, aes(x = Celltype, y = Composition))+
labs(y="Cell composition",x= NULL,title = "TME Cell composition")+
geom_boxplot(aes(fill = Group),position=position_dodge(0.5),width=0.5,outlier.alpha = 0)+
scale_fill_manual(values = c("#1CB4B8", "#EB7369"))+
theme_classic() + mytheme +
stat_compare_means(aes(group = Group),
label = "p.signif",
method = "wilcox.test",
hide.ns = T)
box_TME;ggsave("./Output/TCGA_HNSCC_TME.pdf",box_TME,height=15,width=25,unit="cm")
除了常规的结果展示,笔者还看到有一篇文献《The Immune Subtypes and Landscape of Squamous Cell Carcinoma》
,将Cibersort计算得到的20类免疫细胞大致分为4大类:Total lymphocytes
,Total dendritic cell
,Total macrophage
及Total mast cell
。此外,还计算了M1/M2巨噬细胞的比例。方法部分的描述如下:
4.1. 提取文献用到的前20种免疫细胞
##
# 4.1 取20种免疫细胞
TCGA_TME_four = as.data.frame(TCGA_TME.results[,1:20])
head(TCGA_TME_four,3)
## B cells naive B cells memory Plasma cells T cells CD8 T cells CD4 naive
## TCGA.CV.6943.01 0.007651678 0.00000000 0.04249975 0.33831313 0
## TCGA.CV.6959.01 0.019549031 0.00000000 0.05950866 0.03842026 0
## TCGA.CV.7438.11 0.025349204 0.01107915 0.06419163 0.14099744 0
## T cells CD4 memory resting T cells CD4 memory activated T cells follicular helper
## TCGA.CV.6943.01 0.00000000 0.06796614 0.03913214
## TCGA.CV.6959.01 0.11798455 0.07211551 0.00000000
## TCGA.CV.7438.11 0.06486108 0.03390592 0.07309813
## T cells regulatory (Tregs) T cells gamma delta NK cells resting NK cells activated
## TCGA.CV.6943.01 0.05035414 0.00000000 0.007625709 0.000000000
## TCGA.CV.6959.01 0.00000000 0.00000000 0.037617016 0.001607423
## TCGA.CV.7438.11 0.04039972 0.02312377 0.000000000 0.030996791
## Monocytes Macrophages M0 Macrophages M1 Macrophages M2 Dendritic cells resting
## TCGA.CV.6943.01 0 0.04310127 0.2272938 0.08760478 0.010476052
## TCGA.CV.6959.01 0 0.29903927 0.1140220 0.08606996 0.008655281
## TCGA.CV.7438.11 0 0.12215035 0.1099711 0.05154016 0.104453345
## Dendritic cells activated Mast cells resting Mast cells activated
## TCGA.CV.6943.01 0.00000000 0.07769942 0.0000000
## TCGA.CV.6959.01 0.03261665 0.00000000 0.0919163
## TCGA.CV.7438.11 0.03317884 0.06982843 0.0000000
4.2 根据文献整理得到的免疫细胞分类
# 4.2 根据文献整理得到的免疫细胞分类
immCell_four_type <- read.table("./database/Cibersort_four_types.txt", header = T, row.names = NULL, sep = "\t")
colnames(TCGA_TME_four) == immCell_four_type$Immune.cells #T
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
head(immCell_four_type)
## Immune.cells Types
## 1 B cells naive Lymphocytes
## 2 B cells memory Lymphocytes
## 3 Plasma cells Lymphocytes
## 4 T cells CD8 Lymphocytes
## 5 T cells CD4 naive Lymphocytes
## 6 T cells CD4 memory resting Lymphocytes
4.3 计算每一个大类的免疫得分
# 4.3 数据预处理
TCGA_TME_four$group = group_list
TCGA_TME_four$sample <- row.names(TCGA_TME_four)
TME_four_new = melt(TCGA_TME_four)
## Using group, sample as id variables
colnames(TME_four_new) = c("Group","Sample","Immune.cells","Composition")
TCGA_TME_four_new2 = left_join(TME_four_new, immCell_four_type, by = "Immune.cells") %>%
group_by(Sample,Group,Types) %>%
summarize(Sum = sum(Composition))
## `summarise()` regrouping output by 'Sample', 'Group' (override with `.groups` argument)
# 出图
box_four_immtypes <- ggplot(TCGA_TME_four_new2, aes(x = Group, y = Sum))+
labs(y="Cell composition",x= NULL,title = "TCGA")+
geom_boxplot(aes(fill = Group),position=position_dodge(0.5),width=0.5,size=0.4,
outlier.alpha = 1, outlier.size = 0.5)+
theme_bw() + mytheme +
scale_fill_manual(values = c("#1CB4B8","#EB7369"))+
scale_y_continuous(labels = scales::percent)+
facet_wrap(~ Types,scales = "free",ncol = 4) +
stat_compare_means(aes(group = Group),
label = "p.format",
method = "wilcox.test",
size = 3.5,
hide.ns = T)
box_four_immtypes;ggsave("./Output/TCHA_HNSC_Cibersort_four_immune_cell_types.pdf",
box_four_immtypes ,height= 10,width=25,unit="cm")
Cibersort
网页版实现网页版和代码版输出的结果是等价的。区别在于用R代码运行Cibersort
非常耗时,但胜在比较自由方便;而网页版的好处在于在线运行数据,上传和运行之后,即使关闭网页也能拿到数据,缺点在于网页版不太稳定,网络不好的时候很难登录和使用。
笔者再介绍一下网页版的使用步骤:
首先是注册和登录,登录网址是https://cibersortx.stanford.edu/,注册需要edu的邮箱。
第二步是上传数据,如下图所示,点击Menu—Upload files—Add files上传txt数据,数据格式详见示例数据。
第三步,配置参数,准备运行,点击Menu—Run CIBERSORTx—2.Impute Cell Fractions,具体的配置如下:
和代码版的一样,为了加速运行的速度,Permutations for significance analysis这里选择了50次:
第四步,运行一段时间之后,可以看到结果,Menu—Job Results,点击CSV或者XLSX可得到预测的结果,即为模块一的输出数据。
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] preprocessCore_1.50.0 e1071_1.7-3 dplyr_1.0.0 ggpubr_0.4.0
## [5] reshape2_1.4.4 ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] tinytex_0.24 tidyselect_1.1.0 xfun_0.15 purrr_0.3.4 haven_2.3.1
## [6] carData_3.0-4 colorspace_1.4-1 vctrs_0.3.1 generics_0.1.0 htmltools_0.5.0
## [11] yaml_2.2.1 rlang_0.4.6 pillar_1.4.6 foreign_0.8-80 glue_1.4.1
## [16] withr_2.2.0 readxl_1.3.1 lifecycle_0.2.0 plyr_1.8.6 stringr_1.4.0
## [21] munsell_0.5.0 ggsignif_0.6.0 gtable_0.3.0 cellranger_1.1.0 zip_2.1.1
## [26] evaluate_0.14 labeling_0.3 knitr_1.29 rio_0.5.16 forcats_0.5.0
## [31] class_7.3-17 curl_4.3 fansi_0.4.1 broom_0.7.0 Rcpp_1.0.5
## [36] scales_1.1.1 backports_1.1.7 abind_1.4-5 farver_2.0.3 hms_0.5.3
## [41] digest_0.6.25 stringi_1.4.6 openxlsx_4.1.5 rstatix_0.6.0 grid_4.0.2
## [46] cli_2.0.2 tools_4.0.2 magrittr_1.5 tibble_3.0.2 crayon_1.3.4
## [51] tidyr_1.1.0 car_3.0-8 pkgconfig_2.0.3 ellipsis_0.3.1 data.table_1.12.8
## [56] assertthat_0.2.1 rmarkdown_2.7 rstudioapi_0.11 R6_2.4.1 igraph_1.2.5
## [61] compiler_4.0.2