文章/答案/技术大牛

发布

社区首页 >专栏 >Cibersort免疫浸润的在线分析及R语言代码实现

Cibersort免疫浸润的在线分析及R语言代码实现

生信宝典

发布于 2022-01-18 11:41:11

6.3K02

代码可运行

文章被收录于专栏：生信宝典生信宝典

运行总次数：2

代码可运行

上期展示了ESITMATE（基于转录组数据）计算免疫得分和肿瘤纯度的一个例子，详见ggplot2实现分半小提琴图绘制基因表达谱和免疫得分。实际上计算肿瘤纯度的方法还有InfiniumPurify(基于甲基化数据)、ABSOLUTE（基于体细胞拷贝数变异）、PurityEst（基于突变数据）等等，而计算免疫浸润的有Cibersort、ssGSEA、TIMER等算法。

本期我们介绍一下简单易上手的Cibersort算法。CIBERSORTx（https://cibersortx.stanford.edu/）是Newman等人开发的一种分析工具，可基于基因表达数据估算免疫细胞的丰度：

本文主要从以下两个模块进行介绍：

一. Cibersort算法的纯代码实现及结果绘图展示；

二. Cibersort算法的网页版实现。

在模块一绘图部分，笔者还根据一篇文献，将得到的免疫细胞分为4大类：Total lymphocytes，Total dendritic cell，Total macrophage及Total mast cell。

沿用前几期的示例数据，本期使用的还是TCGA头颈癌的43对癌和癌旁FPKM数据。参考文献《Profiling tumor infiltrating immune cells with CIBERSORT》，可知Cibersort对于输入数据的要求如下：

值得注意的是，虽然文献建议输入数据要求是non-log linear space，但是检查Cibersort可看到有一段函数if(max(Y) < 50) {Y <- 2^Y}(见模块一的源代码)，专门用于检测输入数据是否为log化后的输入数据，如果是（max < 50），则会自动还原为log前的数据。因此，用户不必纠结于log与否。

总结来说，输入数据是芯片数据还是RNA-seq数据都可以，RNA-seq数据不建议直接使用Count，应使用FPKM和TPM或DESEq2标准化后的矩阵为宜。

但输入数据的基因名字需要为Gene symbol

注意不能有重复的基因名和缺失/负值数据，其他具体的一些细节，还是需要用户去浏览上面那篇文献。

如何处理重复的基因名，如何得到TCGA的FPKM数据，可以参考前几期的推文，利用R代码从UCSC XENA下载mRNA, lncRNA, miRNA表达数据并匹配临床信息。

模块一. `Cibersort`算法的纯代码实现及结果绘图展示

remove(list = ls()) #一键清空
#加载包
library(ggplot2)
library(reshape2)
library(ggpubr)
library(dplyr)

1. Cibersort计算免疫细胞

加载cibersort的官方提供的源码，指定基准数据库文件 (LM22.txt，这是22种免疫细胞的marker基因，下载自Cibersort官网)。

source('./assist/Cibersort.R')

# 设置分析依赖的基础表达文件
# 每类免疫细胞的标志性基因及其表达
# 基因名字为Gene symbol
LM22.file <- "./database/LM22.txt"

加载自己的数据用于分析计算免疫细胞

# 1. Cibersort

TCGA_exp.file <- "./Rawdata/TCGA_HNSC_mRNA_fpkm_paired_43vs43.txt"

TCGA_TME.results <- CIBERSORT(LM22.file ,TCGA_exp.file, perm = 50, QN = F)  
# perm置换次数=1000
# QN如果是芯片设置为T，如果是测序就设置为F

write.csv(TCGA_TME.results, "./Output/TCGA_CIBERSORT_Results_fromRcode.csv")

Permutations for significance analysis是用来计算单个样本估算免疫浸润的p值，大多数文章会采用1000次。数值越大，运行时间越久，这里笔者为了加速运行的速度，选择了50次（笔记本运行耗时5分钟）。

2. 分组信息

## 2. 分组信息

# TCGA的数据还可以从名字获取
# group_list <- ifelse(as.numeric(substring(rownames(TCGA_TME.results),14,15)) < 10,
#                    "Tumor","Normal") %>% 
#  factor(.,levels = c("Normal","Tumor"))

phenotype = read.csv("./Rawdata/TCGA_HNSC_paired_metadata.csv",header = T,row.names = 1)

group_list <- phenotype$group %>% 
  factor(.,levels = c("Nontumor","Tumor"))

table(group_list) # Normal 43   Tumor 43 

## group_list
## Nontumor    Tumor 
##       43       43

3. 绘图

3.1 数据转换预处理，取前22列，忽略掉后面计算出的P-value,Correlation, RMSE单列信息。

## 3. 绘图
  # 3.1 数据粗处理
TME_data <- as.data.frame(TCGA_TME.results[,1:22])

TME_data$group <- group_list
TME_data$sample <- row.names(TME_data)

  # 2.2 融合数据
TME_New = melt(TME_data)

## Using group, sample as id variables

colnames(TME_New)=c("Group","Sample","Celltype","Composition")  #设置行名
head(TME_New)

##      Group          Sample      Celltype Composition
## 1    Tumor TCGA.CV.6943.01 B cells naive 0.007651678
## 2    Tumor TCGA.CV.6959.01 B cells naive 0.019549031
## 3 Nontumor TCGA.CV.7438.11 B cells naive 0.025349204
## 4 Nontumor TCGA.CV.7242.11 B cells naive 0.032583659
## 5    Tumor TCGA.CV.7432.01 B cells naive 0.000000000
## 6 Nontumor TCGA.CV.6939.11 B cells naive 0.074282293

3.2 按免疫细胞占比中位数排序绘图（可选）

  # 3.3 按免疫细胞占比中位数排序绘图（可选）
plot_order = TME_New[TME_New$Group=="Tumor",] %>% 
  group_by(Celltype) %>% 
  summarise(m = median(Composition)) %>% 
  arrange(desc(m)) %>% 
  pull(Celltype)

## `summarise()` ungrouping output (override with `.groups` argument)

TME_New$Celltype = factor(TME_New$Celltype,levels = plot_order)

3.3 绘制箱线图

# 3.3 出图
if(T){
  mytheme <- theme(plot.title = element_text(size = 12,color="black",hjust = 0.5),
                   axis.title = element_text(size = 12,color ="black"), 
                   axis.text = element_text(size= 12,color = "black"),
                   panel.grid.minor.y = element_blank(),
                   panel.grid.minor.x = element_blank(),
                   axis.text.x = element_text(angle = 45, hjust = 1 ),
                   panel.grid=element_blank(),
                   legend.position = "top",
                   legend.text = element_text(size= 12),
                   legend.title= element_text(size= 12)
  ) }

box_TME <- ggplot(TME_New, aes(x = Celltype, y = Composition))+ 
  labs(y="Cell composition",x= NULL,title = "TME Cell composition")+  
  geom_boxplot(aes(fill = Group),position=position_dodge(0.5),width=0.5,outlier.alpha = 0)+ 
  scale_fill_manual(values = c("#1CB4B8", "#EB7369"))+
  theme_classic() + mytheme + 
  stat_compare_means(aes(group =  Group),
                     label = "p.signif",
                     method = "wilcox.test",
                     hide.ns = T)

box_TME;ggsave("./Output/TCGA_HNSCC_TME.pdf",box_TME,height=15,width=25,unit="cm")

除了常规的结果展示，笔者还看到有一篇文献《The Immune Subtypes and Landscape of Squamous Cell Carcinoma》，将Cibersort计算得到的20类免疫细胞大致分为4大类：Total lymphocytes，Total dendritic cell，Total macrophage及Total mast cell。此外，还计算了M1/M2巨噬细胞的比例。方法部分的描述如下：

4.1. 提取文献用到的前20种免疫细胞

##
  # 4.1  取20种免疫细胞
TCGA_TME_four = as.data.frame(TCGA_TME.results[,1:20])
head(TCGA_TME_four,3)

##                 B cells naive B cells memory Plasma cells T cells CD8 T cells CD4 naive
## TCGA.CV.6943.01   0.007651678     0.00000000   0.04249975  0.33831313                 0
## TCGA.CV.6959.01   0.019549031     0.00000000   0.05950866  0.03842026                 0
## TCGA.CV.7438.11   0.025349204     0.01107915   0.06419163  0.14099744                 0
##                 T cells CD4 memory resting T cells CD4 memory activated T cells follicular helper
## TCGA.CV.6943.01                 0.00000000                   0.06796614                0.03913214
## TCGA.CV.6959.01                 0.11798455                   0.07211551                0.00000000
## TCGA.CV.7438.11                 0.06486108                   0.03390592                0.07309813
##                 T cells regulatory (Tregs) T cells gamma delta NK cells resting NK cells activated
## TCGA.CV.6943.01                 0.05035414          0.00000000      0.007625709        0.000000000
## TCGA.CV.6959.01                 0.00000000          0.00000000      0.037617016        0.001607423
## TCGA.CV.7438.11                 0.04039972          0.02312377      0.000000000        0.030996791
##                 Monocytes Macrophages M0 Macrophages M1 Macrophages M2 Dendritic cells resting
## TCGA.CV.6943.01         0     0.04310127      0.2272938     0.08760478             0.010476052
## TCGA.CV.6959.01         0     0.29903927      0.1140220     0.08606996             0.008655281
## TCGA.CV.7438.11         0     0.12215035      0.1099711     0.05154016             0.104453345
##                 Dendritic cells activated Mast cells resting Mast cells activated
## TCGA.CV.6943.01                0.00000000         0.07769942            0.0000000
## TCGA.CV.6959.01                0.03261665         0.00000000            0.0919163
## TCGA.CV.7438.11                0.03317884         0.06982843            0.0000000

4.2 根据文献整理得到的免疫细胞分类

  # 4.2 根据文献整理得到的免疫细胞分类
immCell_four_type <- read.table("./database/Cibersort_four_types.txt", header = T, row.names = NULL, sep = "\t")
colnames(TCGA_TME_four) == immCell_four_type$Immune.cells #T

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

head(immCell_four_type)

##                 Immune.cells       Types
## 1              B cells naive Lymphocytes
## 2             B cells memory Lymphocytes
## 3               Plasma cells Lymphocytes
## 4                T cells CD8 Lymphocytes
## 5          T cells CD4 naive Lymphocytes
## 6 T cells CD4 memory resting Lymphocytes

4.3 计算每一个大类的免疫得分

  # 4.3 数据预处理
TCGA_TME_four$group = group_list
TCGA_TME_four$sample <- row.names(TCGA_TME_four)
TME_four_new = melt(TCGA_TME_four)

## Using group, sample as id variables

colnames(TME_four_new) = c("Group","Sample","Immune.cells","Composition")

TCGA_TME_four_new2 = left_join(TME_four_new, immCell_four_type, by = "Immune.cells") %>% 
                group_by(Sample,Group,Types) %>%
                summarize(Sum = sum(Composition))

## `summarise()` regrouping output by 'Sample', 'Group' (override with `.groups` argument)

  # 出图
box_four_immtypes <- ggplot(TCGA_TME_four_new2, aes(x = Group, y = Sum))+ 
  labs(y="Cell composition",x= NULL,title = "TCGA")+  
  geom_boxplot(aes(fill = Group),position=position_dodge(0.5),width=0.5,size=0.4,
               outlier.alpha = 1, outlier.size = 0.5)+ 
  theme_bw() + mytheme + 
  scale_fill_manual(values = c("#1CB4B8","#EB7369"))+ 
  scale_y_continuous(labels = scales::percent)+
  facet_wrap(~ Types,scales = "free",ncol = 4) + 
  stat_compare_means(aes(group =  Group),
                     label = "p.format",
                     method = "wilcox.test",
                     size = 3.5,
                     hide.ns = T)
box_four_immtypes;ggsave("./Output/TCHA_HNSC_Cibersort_four_immune_cell_types.pdf",
                         box_four_immtypes ,height= 10,width=25,unit="cm")

模块二. `Cibersort`网页版实现

网页版和代码版输出的结果是等价的。区别在于用R代码运行Cibersort非常耗时，但胜在比较自由方便；而网页版的好处在于在线运行数据，上传和运行之后，即使关闭网页也能拿到数据，缺点在于网页版不太稳定，网络不好的时候很难登录和使用。

笔者再介绍一下网页版的使用步骤：

首先是注册和登录，登录网址是https://cibersortx.stanford.edu/，注册需要edu的邮箱。

第二步是上传数据，如下图所示，点击Menu—Upload files—Add files上传txt数据，数据格式详见示例数据。

第三步，配置参数，准备运行，点击Menu—Run CIBERSORTx—2.Impute Cell Fractions，具体的配置如下：

和代码版的一样，为了加速运行的速度，Permutations for significance analysis这里选择了50次：

第四步，运行一段时间之后，可以看到结果，Menu—Job Results，点击CSV或者XLSX可得到预测的结果，即为模块一的输出数据。

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936 
## [2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936   
## [3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
## [4] LC_NUMERIC=C                                                   
## [5] LC_TIME=Chinese (Simplified)_People's Republic of China.936    
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] preprocessCore_1.50.0 e1071_1.7-3           dplyr_1.0.0           ggpubr_0.4.0         
## [5] reshape2_1.4.4        ggplot2_3.3.2        
## 
## loaded via a namespace (and not attached):
##  [1] tinytex_0.24      tidyselect_1.1.0  xfun_0.15         purrr_0.3.4       haven_2.3.1      
##  [6] carData_3.0-4     colorspace_1.4-1  vctrs_0.3.1       generics_0.1.0    htmltools_0.5.0  
## [11] yaml_2.2.1        rlang_0.4.6       pillar_1.4.6      foreign_0.8-80    glue_1.4.1       
## [16] withr_2.2.0       readxl_1.3.1      lifecycle_0.2.0   plyr_1.8.6        stringr_1.4.0    
## [21] munsell_0.5.0     ggsignif_0.6.0    gtable_0.3.0      cellranger_1.1.0  zip_2.1.1        
## [26] evaluate_0.14     labeling_0.3      knitr_1.29        rio_0.5.16        forcats_0.5.0    
## [31] class_7.3-17      curl_4.3          fansi_0.4.1       broom_0.7.0       Rcpp_1.0.5       
## [36] scales_1.1.1      backports_1.1.7   abind_1.4-5       farver_2.0.3      hms_0.5.3        
## [41] digest_0.6.25     stringi_1.4.6     openxlsx_4.1.5    rstatix_0.6.0     grid_4.0.2       
## [46] cli_2.0.2         tools_4.0.2       magrittr_1.5      tibble_3.0.2      crayon_1.3.4     
## [51] tidyr_1.1.0       car_3.0-8         pkgconfig_2.0.3   ellipsis_0.3.1    data.table_1.12.8
## [56] assertthat_0.2.1  rmarkdown_2.7     rstudioapi_0.11   R6_2.4.1          igraph_1.2.5     
## [61] compiler_4.0.2

参考资料：

Newman AM, Steen CB, Liu CL, Gentles AJ, Chaudhuri AA, Scherer F, Khodadoust MS, Esfahani MS, Luca BA, Steiner D, Diehn M, Alizadeh AA. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechnol. 2019 Jul;37(7):773-782. doi: 10.1038/s41587-019-0114-2. Epub 2019 May 6. PMID: 31061481; PMCID: PMC6610714.
Chen B, Khodadoust MS, Liu CL, Newman AM, Alizadeh AA. Profiling Tumor Infiltrating Immune Cells with CIBERSORT. Methods Mol Biol. 2018;1711:243-259. doi: 10.1007/978-1-4939-7493-1_12. PMID: 29344893; PMCID: PMC5895181.
Li B, Cui Y, Nambiar DK, Sunwoo JB, Li R. The Immune Subtypes and Landscape of Squamous Cell Carcinoma. Clin Cancer Res. 2019 Jun 15;25(12):3528-3537. doi: 10.1158/1078-0432.CCR-18-4085. Epub 2019 Mar 4. PMID: 30833271; PMCID: PMC6571041.
数据和代码下载： https://gitee.com/ct5869/shengxin-baodian/tree/master/TCGA
作者：赵法明编辑：生信宝典

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2021-05-13，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法