肿瘤单细胞转录组拷贝数分析结果解读和应用

生信技能树jimmy

发布于 2023-10-30 15:37:11

1.8K0

发布于 2023-10-30 15:37:11

文章被收录于专栏：单细胞天地

恰好这个月的学徒介绍了一个肺癌的单细胞转录组和空间转录组联合分析的文章，详见：换一个分析策略会导致文章的全部论点都得推倒重来吗，它里面的一个比较重要的结论上皮细胞细分了8个亚群后，可以根据肿瘤单细胞转录组拷贝数分析结果区分出来里面的4个恶性亚群和4个正常亚群，如下所示：

4个恶性亚群和4个正常亚群

这个肿瘤单细胞转录组拷贝数分析里面最经典的方法就是inferCNV啦，差不多五六年前我研究过它的玩法，就分享后再也没有修改过。但是最近听说参考我教程的小伙伴反馈说这个inferCNV做了一个非常大的更新，导致我前面的教程里面的对inferCNV的结果的解析代码是失效的。正好这次系统性更新一下它！

现在我们就带领大家跑一遍这个分析过程，前面的第一层次降维聚类分群，然后提取上皮细胞后的细分亚群代码就略过了！

首先从前面的第一层次降维聚类分群取不是上皮细胞的任意亚群

第一层次降维聚类分群我们会有全部的细胞的Seurat对象，以及里面的细胞的人工命名信息，然后就可以从取不是上皮细胞的任意亚群，全部的代码如下所示：

  sce.all.int = readRDS('../../2-harmony/sce.all_int.rds')
  colnames(sce.all.int@meta.data) 
  table(sce.all.int$RNA_snn_res.0.8)
  load('../../phe.Rdata')
  sce.all.int@meta.data  =phe 
  table(sce.all.int$celltype)
  
  set.seed(123456789)
  sce=sce.all.int[,phe$celltype=='endo']
  endo.cells  <-  row.names(sce@meta.data) 
  endo.cells=sample(endo.cells,800)
  endoMat=as.data.frame(GetAssayData(subset(sce, cells=endo.cells),
                                     slot='counts',assay='RNA'))
  
  table(phe$celltype)
  sce=sce.all.int[,phe$celltype=='mast']
  mast.cells  <-  row.names(sce@meta.data) 
  mast.cells=sample(mast.cells,800)
  mastMat=as.data.frame(GetAssayData(subset(sce, cells=mast.cells),
                                     slot='counts',assay='RNA'))
  spike_mat1 = endoMat
  spike_mat2 = mastMat
  save(spike_mat1,spike_mat2,file = 'reference_mat.Rdata')

接下来就可以针对上皮细胞亚群细分结果进行矩阵组合

上皮细胞亚群细分结果里面是一万多个细胞，它们虽然也有生物学名字，但是在还没有运行inferCNV之前是不知道恶性与否的状态的：

load(file = 'reference_mat.Rdata')
load(file = 'phe.Rdata')
dim(phe)
sce.all.int = readRDS('2-harmony/sce.all_int.rds')
sce.all.int
sce.all.int=sce.all.int[,colnames(sce.all.int) %in% rownames(phe)]
sce.all.int
identical(colnames(sce.all.int), rownames(phe))
epiMat=as.data.frame(GetAssayData(sce.all.int,
                                  slot='counts',assay='RNA'))
ids = intersect(rownames(epiMat),rownames(spike_mat1))
this_dat=cbind(epiMat[ids,],spike_mat1[ids,],spike_mat2[ids,])
groupinfo=data.frame(v1=colnames(this_dat),
                     v2=c( phe$celltype ,
                           rep('spike-1',300),
                           rep('ref-1',500),
                           rep('spike-2',300),
                           rep('ref-2',500)))
head(groupinfo) 
groupFiles='groupFiles.txt'
write.table(groupinfo,file = groupFiles,
            sep = '\t',quote = F,col.names = F,row.names = F)
print(dim(this_dat))

最后运行inferCNV即可

我这里写了一个简单的函数，myInferCNV(this_dat)，它会自动根据里面的this_dat矩阵信息去拿到它们的坐标后进行排序：

myInferCNV <- function(dat){
  print(dim(dat))
  library(AnnoProbe)
  geneInfor=annoGene(rownames(dat),"SYMBOL",'human')
  colnames(geneInfor)
  geneInfor=geneInfor[with(geneInfor, order(chr, start)),c(1,4:6)]
  geneInfor=geneInfor[!duplicated(geneInfor[,1]),]
  length(unique(geneInfor[,1]))
  head(geneInfor)
  ## 这里可以去除性染色体
  # 也可以把染色体排序方式改变
  dat=dat[rownames(dat) %in% geneInfor[,1],]
  dat=dat[match( geneInfor[,1], rownames(dat) ),]
  dim(dat)
  expFile='expFile.txt'
  #write.table(dat,file = expFile,sep = '\t',quote = F)
  library(data.table)
  fwrite(dat, file = expFile,  row.names = T,sep = "\t", quote = FALSE)
  
  head(geneInfor)
  geneFile='geneFile.txt'
  write.table(geneInfor,file = geneFile,sep = '\t',quote = F,col.names = F,row.names = F)
  
  expFile='expFile.txt'
  groupFiles='groupFiles.txt'
  geneFile='geneFile.txt'
  
  # duplicate 'row.names' are not allowed
  library(infercnv)
  infercnv_obj = CreateInfercnvObject(raw_counts_matrix=expFile,
                                      annotations_file=groupFiles,
                                      delim="\t",
                                      gene_order_file= geneFile,
                                      ref_group_names=c('ref-1',
                                                        'ref-2'))  ## 这个取决于自己的分组信息里面的
  
  
  # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
  infercnv_obj2 = infercnv::run(infercnv_obj,
                                cutoff=0.1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics
                                out_dir= "infercnv_output",  # dir is auto-created for storing outputs
                                cluster_by_groups=T ,   # cluster
                                hclust_method="ward.D2", plot_steps=F)
  
  
  
}

myInferCNV(this_dat)

真正运行infercnv的是 library(infercnv) 里面的 infercnv::run 函数哦！

解读infercnv的结果

运行完毕后很简单的看看infercnv的结果是拷贝数变异热图，非常容易理解，大概是里面的0,1,6,7是可能的恶性肿瘤细胞亚群，而cycle亚群肯定是恶性的，其它目前不好说。如果你对前面的前面的第一层次降维聚类分群，然后提取上皮细胞后的细分亚群代码有疑惑，可以看：换一个分析策略会导致文章的全部论点都得推倒重来吗

大概是里面的0,1,6,7是可能的恶性肿瘤细胞亚群

应用infercnv的结果

上面的肉眼查看拷贝数，会有一点点主观了，去判断细胞亚群恶性与否。更好的方法是计算具体的每个细胞的拷贝数打分，我这里给出来一个自己的方法，需要读取infercnv_output文件夹里面的 run.final.infercnv_obj对象文件：

infer_CNV_obj<-readRDS('./infercnv_output/run.final.infercnv_obj')
expr<-infer_CNV_obj@expr.data
expr[1:4,1:4]
data_cnv<-as.data.frame(expr)
dim(expr)
colnames(data_cnv)
rownames(data_cnv)

load(file = 'phe.Rdata')
dim(phe)
sce.all.int = readRDS('2-harmony/sce.all_int.rds')
sce.all.int
sce.all.int=sce.all.int[,colnames(sce.all.int) %in% rownames(phe)]
sce.all.int
identical(colnames(sce.all.int), rownames(phe))
sce.all.int$celltype = phe$celltype
table(sce.all.int$celltype )
meta <- sce.all.int@meta.data

接下来就是解析读取好的读取infercnv_output文件夹里面的 run.final.infercnv_obj对象文件里面的信息啦：

if(T){
  tmp1 = expr[,infer_CNV_obj@reference_grouped_cell_indices$`ref-1`]
  tmp2 = expr[,infer_CNV_obj@reference_grouped_cell_indices$`ref-2`]
  tmp= cbind(tmp1,tmp2)
  down=mean(rowMeans(tmp)) - 2 * mean( apply(tmp, 1, sd))
  up=mean(rowMeans(tmp)) + 2 * mean( apply(tmp, 1, sd))
  oneCopy=up-down
  oneCopy
  a1= down- 2*oneCopy
  a2= down- 1*oneCopy
  down;up
  a3= up +  1*oneCopy
  a4= up + 2*oneCopy 
  
  cnv_score_table<-infer_CNV_obj@expr.data
  cnv_score_table[1:4,1:4]
  cnv_score_mat <- as.matrix(cnv_score_table)
  
  # Scoring
  cnv_score_table[cnv_score_mat > 0 & cnv_score_mat < a2] <- "A" #complete loss. 2pts
  cnv_score_table[cnv_score_mat >= a2 & cnv_score_mat < down] <- "B" #loss of one copy. 1pts
  cnv_score_table[cnv_score_mat >= down & cnv_score_mat <  up ] <- "C" #Neutral. 0pts
  cnv_score_table[cnv_score_mat >= up  & cnv_score_mat <= a3] <- "D" #addition of one copy. 1pts
  cnv_score_table[cnv_score_mat > a3  & cnv_score_mat <= a4 ] <- "E" #addition of two copies. 2pts
  cnv_score_table[cnv_score_mat > a4] <- "F" #addition of more than two copies. 2pts
  
  # Check
  table(cnv_score_table[,1])
  # Replace with score 
  cnv_score_table_pts <- cnv_score_mat
  rm(cnv_score_mat)
  # 
  cnv_score_table_pts[cnv_score_table == "A"] <- 2
  cnv_score_table_pts[cnv_score_table == "B"] <- 1
  cnv_score_table_pts[cnv_score_table == "C"] <- 0
  cnv_score_table_pts[cnv_score_table == "D"] <- 1
  cnv_score_table_pts[cnv_score_table == "E"] <- 2
  cnv_score_table_pts[cnv_score_table == "F"] <- 2
   
  cnv_score_table_pts[1:4,1:4]
  str(  as.data.frame(cnv_score_table_pts[1:4,1:4])) 
  cell_scores_CNV <- as.data.frame(colSums(cnv_score_table_pts))
  
  colnames(cell_scores_CNV) <- "cnv_score" 
}

上面的代码有一点点复杂，可能是需要大家自行慢慢理解了，有统计学知识这里面。量化的具体的每个细胞的拷贝数打分，进行简单的可视化：

head(cell_scores_CNV) 
score=cell_scores_CNV
head(score)
meta$totalCNV = score[match(colnames(sce.all.int),
                            rownames(score)),1] 
ggplot(meta, aes(x=celltype  , y=totalCNV, fill=celltype  )) +
  geom_boxplot()

如下所示可以看到，其中0,1,6还有cycle是恶性细胞亚群啦，恰好是4群，但是7这个亚群就很迷，前面的肉眼看拷贝数热图似乎是一个恶性肿瘤细胞亚群，但是这里看打分呢又没那么恶性：

7这个亚群就很迷

前面的前面的第一层次降维聚类分群，然后提取上皮细胞后的细分亚群代码有疑惑，可以看：换一个分析策略会导致文章的全部论点都得推倒重来吗，整体来说这个复现的代码在百度云分享给大家：

链接：https://pan.baidu.com/s/1niFqyAiUU3yXK1W26b8RvQ?pwd=nbmj

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2023-10-29，如有侵权请联系 cloudcommunity@tencent.com 删除

data

本文分享自单细胞天地微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

0 条评论

热度