比较不同的对单细胞转录组数据聚类的方法

生信技能树

发布于 2018-03-09 10:49:19

4.7K0

发布于 2018-03-09 10:49:19

文章被收录于专栏：生信技能树

背景介绍

聚类之前必须要对表达矩阵进行normalization，而且要去除一些批次效应等外部因素。通过对表达矩阵的聚类，可以把细胞群体分成不同的状态，解释为什么会有不同的群体。不过从计算的角度来说，聚类还是蛮复杂的，各个细胞并没有预先标记好，而且也没办法事先知道可以聚多少类。尤其是在单细胞转录组数据里面有很高的噪音，基因非常多，意味着的维度很高。

对这样的高维数据，需要首先进行降维，可以选择PCA或者t-SNE方法。聚类的话，一般都是无监督聚类方法，比如：hierarchical clustering, k-means clustering and graph-based clustering。算法略微有一点复杂，略过吧。

这里主要比较6个常见的单细胞转录组数据的聚类包：

SINCERA
pcaReduce
SC3
tSNE + k-means
SEURAT
SNN-Cliq

所以需要安装并且加载一些包,安装代码如下；

install.packages('pcaReduce')
## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R") 
biocLite("SC3") 
biocLite("Seurat") 
install.packages("devtools")
library("devtools")
install_github("BPSC","nghiavtr") 
install_github("hemberg-lab/scRNA.seq.funcs")
devtools::install_github("JustinaZ/pcaReduce")

加载代码如下：

library(pcaMethods)
library(pcaReduce)
library(SC3)
library(scater)
library(pheatmap)
set.seed(1234567)

加载测试数据

这里选取的是数据，加载了这个scater包的SCESet对象，包含着一个23730 features, 301 samples 的表达矩阵。

供11已知的种细胞类型，这样聚类的时候就可以跟这个已知信息做对比，看看聚类效果如何。

可以直接用plotPCA来简单PCA并且可视化。

pollen <- readRDS("../pollen/pollen.rds")
pollen

## SCESet (storageMode: lockedEnvironment)
## assayData: 23730 features, 301 samples 
##   element names: exprs, is_exprs, tpm 
## protocolData: none
## phenoData
##   rowNames: Hi_2338_1 Hi_2338_2 ... Hi_GW16_26 (301 total)
##   varLabels: cell_type1 cell_type2 ... is_cell_control (33 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A1BG A1BG-AS1 ... ZZZ3 (23730 total)
##   fvarLabels: mean_exprs exprs_rank ... feature_symbol (11 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

head(fData(pollen))

##          mean_exprs exprs_rank n_cells_exprs total_feature_exprs
## A1BG     0.56418762      12460            79           169.82048
## A1BG-AS1 0.31265010      10621            37            94.10768
## A1CF     0.05453986       6796            59            16.41650
## A2LD1    0.22572953       9781            28            67.94459
## A2M      0.15087563       8855            21            45.41356
## A2M-AS1  0.02428046       5366             3             7.30842
##          pct_total_exprs pct_dropout total_feature_tpm
## A1BG        1.841606e-03    73.75415            481.37
## A1BG-AS1    1.020544e-03    87.70764            538.18
## A1CF        1.780276e-04    80.39867             13.99
## A2LD1       7.368203e-04    90.69767            350.65
## A2M         4.924842e-04    93.02326           1356.63
## A2M-AS1     7.925564e-05    99.00332             88.61
##          log10_total_feature_tpm pct_total_tpm is_feature_control
## A1BG                    2.683380  1.599256e-04              FALSE
## A1BG-AS1                2.731734  1.787996e-04              FALSE
## A1CF                    1.175802  4.647900e-06              FALSE
## A2LD1                   2.546111  1.164965e-04              FALSE
## A2M                     3.132781  4.507134e-04              FALSE
## A2M-AS1                 1.952356  2.943891e-05              FALSE
##          feature_symbol
## A1BG               A1BG
## A1BG-AS1       A1BG-AS1
## A1CF               A1CF
## A2LD1             A2LD1
## A2M                 A2M
## A2M-AS1         A2M-AS1

table(pData(pollen)$cell_type1)

## 
##   2338   2339     BJ   GW16   GW21 GW21+3  hiPSC   HL60   K562   Kera 
##     22     17     37     26      7     17     24     54     42     40 
##    NPC 
##     15

plotPCA(pollen, colour_by = "cell_type1")

可以看到简单的PCA也是可以区分部分细胞类型的，只不过在某些细胞相似性很高的群体区分力度不够，所以需要开发新的算法来解决这个聚类的问题。

SC聚类

pollen <- sc3_prepare(pollen, ks = 2:5)
pollen <- sc3_estimate_k(pollen)
pollen@sc3$k_estimation

## [1] 11

## 准备 SCESet对象 数据给 SC3方法，先预测能聚多少个类，发现恰好是11个。

## 这里是并行计算，所以速度还可以
pollen <- sc3(pollen, ks = 11, biology = TRUE)

pollen

## SCESet (storageMode: lockedEnvironment)
## assayData: 23730 features, 301 samples 
##   element names: exprs, is_exprs, tpm 
## protocolData: none
## phenoData
##   rowNames: Hi_2338_1 Hi_2338_2 ... Hi_GW16_26 (301 total)
##   varLabels: cell_type1 cell_type2 ... sc3_11_log2_outlier_score
##     (35 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: A1BG A1BG-AS1 ... ZZZ3 (23730 total)
##   fvarLabels: mean_exprs exprs_rank ... sc3_11_de_padj (16 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

head(fData(pollen))

##          mean_exprs exprs_rank n_cells_exprs total_feature_exprs
## A1BG     0.56418762      12460            79           169.82048
## A1BG-AS1 0.31265010      10621            37            94.10768
## A1CF     0.05453986       6796            59            16.41650
## A2LD1    0.22572953       9781            28            67.94459
## A2M      0.15087563       8855            21            45.41356
## A2M-AS1  0.02428046       5366             3             7.30842
##          pct_total_exprs pct_dropout total_feature_tpm
## A1BG        1.841606e-03    73.75415            481.37
## A1BG-AS1    1.020544e-03    87.70764            538.18
## A1CF        1.780276e-04    80.39867             13.99
## A2LD1       7.368203e-04    90.69767            350.65
## A2M         4.924842e-04    93.02326           1356.63
## A2M-AS1     7.925564e-05    99.00332             88.61
##          log10_total_feature_tpm pct_total_tpm is_feature_control
## A1BG                    2.683380  1.599256e-04              FALSE
## A1BG-AS1                2.731734  1.787996e-04              FALSE
## A1CF                    1.175802  4.647900e-06              FALSE
## A2LD1                   2.546111  1.164965e-04              FALSE
## A2M                     3.132781  4.507134e-04              FALSE
## A2M-AS1                 1.952356  2.943891e-05              FALSE
##          feature_symbol sc3_gene_filter sc3_11_markers_clusts
## A1BG               A1BG            TRUE                     5
## A1BG-AS1       A1BG-AS1            TRUE                     4
## A1CF               A1CF            TRUE                     2
## A2LD1             A2LD1           FALSE                    NA
## A2M                 A2M           FALSE                    NA
## A2M-AS1         A2M-AS1           FALSE                    NA
##          sc3_11_markers_padj sc3_11_markers_auroc sc3_11_de_padj
## A1BG            7.740802e-10            0.8554452   1.648352e-18
## A1BG-AS1        1.120284e-03            0.6507985   5.575777e-03
## A1CF            5.007946e-23            0.8592113   1.162843e-17
## A2LD1                     NA                   NA             NA
## A2M                       NA                   NA             NA
## A2M-AS1                   NA                   NA             NA

## 可以看到SC3方法处理后的SCESet对象的基因信息增加了5列，比较重要的是sc3_gene_filter信息，决定着该基因是否拿去聚类，因为基因太多了，需要挑选
table(fData(pollen)$sc3_gene_filter)

## 
## FALSE  TRUE 
## 11902 11828

### 只有一半的基因被挑选去聚类了

## 后面是一些可视化
sc3_plot_consensus(pollen, k = 11, show_pdata = "cell_type1")

sc3_plot_silhouette(pollen, k = 11)

sc3_plot_expression(pollen, k = 11, show_pdata = "cell_type1")

sc3_plot_markers(pollen, k = 11, show_pdata = "cell_type1")

plotPCA(pollen, colour_by = "sc3_11_clusters")

## 还支持shiny的交互式聚类，暂时不显示
# sc3_interactive(pollen)

很明显可以看到SC3聚类的效果要好于普通的PCA

pcaReduce

# use the same gene filter as in SC3
input <- exprs(pollen[fData(pollen)$sc3_gene_filter, ])
# run pcaReduce 1 time creating hierarchies from 1 to 30 clusters
pca.red <- PCAreduce(t(input), nbt = 1, q = 30, method = 'S')[[1]]
##  这里对2~30种类别的情况都分别对样本进行分组。
## 我们这里取只有11组的时候，这些样本是如何分组的信息来可视化。
pData(pollen)$pcaReduce <- as.character(pca.red[,32 - 11])
plotPCA(pollen, colour_by = "pcaReduce")

tSNE + kmeans

scater包包装了 Rtsne 和 ggplot2 来做tSNE并且可视化。

pollen <- plotTSNE(pollen, rand_seed = 1, return_SCESet = TRUE)

## 上面的tSNE的结果，下面用kmeans的方法进行聚类，假定是8类细胞类型。
pData(pollen)$tSNE_kmeans <- as.character(kmeans(pollen@reducedDimension, centers = 8)$clust)
plotTSNE(pollen, rand_seed = 1, colour_by = "tSNE_kmeans")

SNN-Cliq

这个有一点难用，算了吧。

distan <- "euclidean"
par.k <- 3
par.r <- 0.7
par.m <- 0.5
# construct a graph
scRNA.seq.funcs::SNN(
    data = t(input),
    outfile = "snn-cliq.txt",
    k = par.k,
    distance = distan
)
# find clusters in the graph
snn.res <- 
    system(
        paste0(
            "python snn-cliq/Cliq.py ", 
            "-i snn-cliq.txt ",
            "-o res-snn-cliq.txt ",
            "-r ", par.r,
            " -m ", par.m
        ),
        intern = TRUE
    )
cat(paste(snn.res, collapse = "\n"))
snn.res <- read.table("res-snn-cliq.txt")
# remove files that were created during the analysis
system("rm snn-cliq.txt res-snn-cliq.txt")
pData(pollen)$SNNCliq <- as.character(snn.res[,1])
plotPCA(pollen, colour_by = "SNNCliq")

SINCERA

至少是在这个数据集上面表现不咋地

# perform gene-by-gene per-sample z-score transformation
dat <- apply(input, 1, function(y) scRNA.seq.funcs::z.transform.helper(y))
# hierarchical clustering
dd <- as.dist((1 - cor(t(dat), method = "pearson"))/2)
hc <- hclust(dd, method = "average")
num.singleton <- 0
kk <- 1
for (i in 2:dim(dat)[2]) {
    clusters <- cutree(hc, k = i)
    clustersizes <- as.data.frame(table(clusters))
    singleton.clusters <- which(clustersizes$Freq < 2)
    if (length(singleton.clusters) <= num.singleton) {
        kk <- i
    } else {
        break;
    }
}
cat(kk)

## 14

pheatmap(
    t(dat),
    cluster_cols = hc,
    cutree_cols = 14,
    kmeans_k = 100,
    show_rownames = FALSE
)

SEURAT

library(Seurat)
pollen_seurat <- new("seurat", raw.data = get_exprs(pollen, exprs_values = "tpm"))
pollen_seurat <- Setup(pollen_seurat, project = "Pollen")
pollen_seurat <- MeanVarPlot(pollen_seurat)
pollen_seurat <- RegressOut(pollen_seurat, latent.vars = c("nUMI"), 
                            genes.regress = pollen_seurat@var.genes)
pollen_seurat <- PCAFast(pollen_seurat)
pollen_seurat <- RunTSNE(pollen_seurat)
pollen_seurat <- FindClusters(pollen_seurat)
TSNEPlot(pollen_seurat, do.label = T)
pData(pollen)$SEURAT <- as.character(pollen_seurat@ident)
sc3_plot_expression(pollen, k = 11, show_pdata = "SEURAT")
markers <- FindMarkers(pollen_seurat, 2)
FeaturePlot(pollen_seurat, 
            head(rownames(markers)), 
            cols.use = c("lightgrey", "blue"), 
            nCol = 3)

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2018-01-19，如有侵权请联系 cloudcommunity@tencent.com 删除

mapreduce

编程算法

本文分享自生信技能树微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

mapreduce

编程算法

登录后参与评论

0 条评论

热度