端到端的单细胞管道SCP-快速开始

生信技能树jimmy

发布于 2023-09-22 14:35:06

1.1K0

发布于 2023-09-22 14:35:06

春风得意马蹄疾，一日看尽长安花

作为快速开始，本章将简单展示SCP三个模块（前处理、下游分析、可视化）的大致功能，各模块各函数的使用细节将在后续教程中详细说明。

目录：
1. 数据探索	9. RNA速率分析
2. 细胞质控	10. 差异表达分析
3. 标准的单细胞处理流程	11. 富集分析（ORA）
4. 批次矫正的单细胞整合处理流程	12. 富集分析（GSEA）
5. 单细胞数据集间的映射	13. 轨迹推断
6. 使用Bulk RNA-seq数据进行细胞注释	14. 动态特征鉴定
7. 使用scRNA-seq数据进行细胞注释	15. 交互式的单细胞数据查询网页（SCExplorer）
8. PAGA分析	16. 一些函数的可视化示例

1、数据探索

本章主要使用下采样后的小鼠胚胎E15.5天的胰腺上皮单细胞数据进行分析，可以通过在R中运行?pancreas_sub来查看该示例数据相关信息。

该数据中的细胞类型已经被注释，其中pancreas_sub[["CellType"]]为粗分细胞类型，分为胰腺导管细胞（Ductal），内分泌前体细胞（ Ngn3 low EP，Ngn3 high EP），前内分泌细胞（Pre-endocrine）和内分泌细胞（Endocrine）五种细胞类型。pancreas_sub[["SubCellType"]]则进一步细分了终末分化的内分泌细胞类型，包括Beta，Alpha，Delta，Epsilon。

另外该数据已含有PCA和UMAP坐标信息，以及spliced和unspliced矩阵（可用于RNA速率分析）。

library(SCP)
library(BiocParallel)
register(MulticoreParam(workers = 8, progressbar = TRUE))

data("pancreas_sub")
print(pancreas_sub)
#> An object of class Seurat
#> 47874 features across 1000 samples within 3 assays
#> Active assay: RNA (15958 features, 3467 variable features)
#>  2 other assays present: spliced, unspliced
#>  2 dimensional reductions calculated: PCA, UMAP

注意，SCP的多线程策略与Seurat不同，使用BiocParallel[1]而不是future[2]。建议使用SCP时不要启用future，否则在一些函数运行上会变慢。

这里使用BiocParallel设置了线程数为8，也可以在运行相关函数时手动设置，SCP支持多线程的函数有：

RunDEtest
RunEnrichment
RunGSEA
RunDynamicFeatures
RunDynamicEnrichment
CellScoring
panel_fix

然后可以使用一些函数对数据进行探索性的可视化：

CellDimPlot(
  srt = pancreas_sub, group.by = c("CellType", "SubCellType"),
  reduction = "UMAP", theme_use = "theme_blank"
)

CellDimPlot(
  srt = pancreas_sub, group.by = "SubCellType", stat.by = "Phase",
  reduction = "UMAP", theme_use = "theme_blank"
)

FeatureDimPlot(
  srt = pancreas_sub, features = c("Sox9", "Neurog3", "Fev", "Rbp4"),
  reduction = "UMAP", theme_use = "theme_blank"
)

FeatureDimPlot(
  srt = pancreas_sub, features = c("Ins1", "Gcg", "Sst", "Ghrl"),
  compare_features = TRUE, label = TRUE, label_insitu = TRUE,
  reduction = "UMAP", theme_use = "theme_blank"
)

ht <- GroupHeatmap(
  srt = pancreas_sub,
  features = c(
    "Sox9", "Anxa2", # Ductal
    "Neurog3", "Hes6", # EPs
    "Fev", "Neurod1", # Pre-endocrine
    "Rbp4", "Pyy", # Endocrine
    "Ins1", "Gcg", "Sst", "Ghrl" # Beta, Alpha, Delta, Epsilon
  ),
  group.by = c("CellType", "SubCellType"),
  heatmap_palette = "YlOrRd",
  cell_annotation = c("Phase", "G2M_score", "Cdh2"),
  cell_annotation_palette = c("Dark2", "Paired", "Paired"),
  show_row_names = TRUE, row_names_side = "left",
  add_dot = TRUE, add_reticle = TRUE
)
print(ht$plot)

2、细胞质控

RunCellQC 将会采用默认质控策略对各细胞进行质控指标的计算和分类。

因为示例数据已无需质控，这里仅作演示：

pancreas_sub <- RunCellQC(srt = pancreas_sub)
CellDimPlot(srt = pancreas_sub, group.by = "CellQC", reduction = "UMAP")

CellStatPlot(srt = pancreas_sub, stat.by = "CellQC", group.by = "CellType", label = TRUE)

CellStatPlot(
  srt = pancreas_sub,
  stat.by = c(
    "db_qc", "outlier_qc", "umi_qc", "gene_qc",
    "mito_qc", "ribo_qc", "ribo_mito_ratio_qc", "species_qc"
  ),
  plot_type = "upset", stat_level = "Fail"
)

3、标准的单细胞数据处理流程

对于单个单细胞样本，可以直接运行Standard_SCP 来进行normalization, feature selection, dimension reduction, clustering等标准的单细胞数据处理流程，最终获得细胞分群，降维后坐标等：

pancreas_sub <- Standard_SCP(srt = pancreas_sub)
CellDimPlot(
  srt = pancreas_sub, group.by = c("CellType", "SubCellType"),
  reduction = "StandardUMAP2D", theme_use = "theme_blank"
)

Standard_SCP默认还会计算UMAP的三维空间嵌入坐标，可进行3D可视化：

CellDimPlot3D(srt = pancreas_sub, group.by = "SubCellType")

4、批次矫正的单细胞数据整合处理流程

对于具有批次效应的多批次数据，可以使用Integration_SCP 进行批次矫正。

Integration_SCP 需要两个设置额外参数：

batch Seurat对象中用于区分批次信息的meta.data属性
integration_method 用于矫正的算法，支持Seurat, scVI, MNN, fastMNN, Harmony, Scanorama, BBKNN, CSS, LIGER, Conos, Combat，不作矫正的话可以设置为Uncorrected

接下以一个下采样后的8个成人胰腺单细胞数据集为例，进行不同算法的批次效应矫正，可以通过在R中运行?panc8_sub来查看该示例数据相关信息：

data("panc8_sub")
panc8_sub <- Integration_SCP(srtMerge = panc8_sub, batch = "tech", integration_method = "Seurat")
CellDimPlot(
  srt = panc8_sub, group.by = c("celltype", "tech"), reduction = "SeuratUMAP2D",
  title = "Seurat", theme_use = "theme_blank"
)

不矫正以及其余批次矫正算法的结果：

5、单细胞数据集间的映射

到这里已经有了两个胰腺数据集，虽然是分别小鼠和人两个物种，且一个是胚胎胰腺另一个是成人胰腺，但是许多细胞类型是类似的，理论上相同的细胞类型会有近邻关系。

所以可以对这两个数据集进行细胞映射，映射前首先需要将小鼠基因名称转换成人基因名称（这里为了简单起见直接首字母大写，但推荐使用GeneConvert进行物种间的基因的ID转换）：

panc8_rename <- RenameFeatures(srt = panc8_sub, newnames = make.unique(capitalize(rownames(panc8_sub), force_tolower = TRUE)), assays = "RNA")
srt_query <- RunKNNMap(srt_query = pancreas_sub, srt_ref = panc8_rename, ref_umap = "SeuratUMAP2D")
ProjectionPlot(
  srt_query = srt_query, srt_ref = panc8_rename,
  query_group = "SubCellType", ref_group = "celltype"
)

当然，相同物种、相同组织的数据集作细胞映射是最佳的：

srt_ref <- panc8_sub[, panc8_sub$tech != "smartseq2"]
srt_query <- panc8_sub[, panc8_sub$tech == "smartseq2"]
srt_query <- RunKNNMap(srt_query = srt_query, srt_ref = srt_ref, ref_umap = "SeuratUMAP2D")
ProjectionPlot(
  srt_query = srt_query, srt_ref = srt_ref,
  query_group = "celltype", ref_group = "celltype"
)

显然映射结果基本上可以匹配正确的细胞类型，因此很多时候可以利用相同组织已注释的数据集，映射手头的数据，来直观的进行细胞注释。

6、使用Bulk RNA-seq数据进行细胞注释

细胞注释和细胞映射在背后的计算上是类似的，区别在于前者的目标是明确细胞类型（或其他属性）间的对应关系，而不是空间坐标的邻近关系。

SCP准备了人和鼠的参考集scHCL[3]和scMCA[4]（行为基因，列为各细胞类型）。我们将scMCA[5]其视作一个Bulk rna-seq来进行细胞类型注释，并过滤掉小于20个细胞的细胞类型并归为unreliable：

data("ref_scMCA")
dim(ref_scMCA)
#> [1] 3028  894
pancreas_sub <- RunKNNPredict(srt_query = pancreas_sub, bulk_ref = ref_scMCA, filter_lowfreq = 20)
CellDimPlot(srt = pancreas_sub, group.by = "KNNPredict_classification", reduction = "UMAP", label = TRUE)

7、使用scRNA-seq数据进行细胞注释

利用之前成人胰腺的scRNA-seq数据也可以直接进行注释，注释可在单个细胞层面或整个细胞类型层面进行：

pancreas_sub <- RunKNNPredict(
  srt_query = pancreas_sub, srt_ref = panc8_rename,
  ref_group = "celltype", filter_lowfreq = 20
)
CellDimPlot(srt = pancreas_sub, group.by = "KNNPredict_classification", reduction = "UMAP", label = TRUE)

pancreas_sub <- RunKNNPredict(
  srt_query = pancreas_sub, srt_ref = panc8_rename,
  query_group = "SubCellType", ref_group = "celltype",
  return_full_distance_matrix = TRUE
)
CellDimPlot(srt = pancreas_sub, group.by = "KNNPredict_classification", reduction = "UMAP", label = TRUE)

在细胞类型注释时，RunKNNPredict会使用距离或相似性指标进行两个数据集的细胞间比较或细胞类型间比较，最终确定细胞类型，这里也可以使用CellCorHeatmap来直接观察细胞类型间的距离或相似性，并标记出前三个最高的相似性值：

ht <- CellCorHeatmap(
  srt_query = pancreas_sub, srt_ref = panc8_rename,
  query_group = "SubCellType", ref_group = "celltype",
  nlabel = 3, label_by = "row",
  show_row_names = TRUE, show_column_names = TRUE
)
print(ht$plot)

8、PAGA分析

PAGA分析可以推断出细胞群之间的邻近关系或轨迹，建立好SCP的python环境后，直接运行：

pancreas_sub <- RunPAGA(
  srt = pancreas_sub, group_by = "SubCellType",
  linear_reduction = "PCA", nonlinear_reduction = "UMAP"
)
PAGAPlot(srt = pancreas_sub, reduction = "UMAP", label = TRUE, label_insitu = TRUE, label_repel = TRUE)

注意，PAGA的结果存储在了pancreas_sub@misc$paga中。

9、RNA速率分析

RNA速率的估计需要Seurat对象含有spliced和unspliced两个assay，可以通过names(pancreas_sub@assays)查看，如果不存在需要使用velocyto[6],bustools[7]或者alevin[8]来生成矩阵并存入Seurat对象中。

在运行RunPAGA或RunSCVELO的过程中，如果使用的是Rstudio，则可以在其plot窗口看到相关的输出结果，运行结束后相关的结果将保存在Seurat对象中。

pancreas_sub <- RunSCVELO(
  srt = pancreas_sub, group_by = "SubCellType",
  linear_reduction = "PCA", nonlinear_reduction = "UMAP"
)
VelocityPlot(srt = pancreas_sub, reduction = "UMAP", group_by = "SubCellType")

VelocityPlot(srt = pancreas_sub, reduction = "UMAP", plot_type = "stream")

10、差异表达分析

SCP中的差异分析使用的是RunDEtest，其中主要调用了Seurat的FindMarkers函数，并使用了多线程来完成计算：

pancreas_sub <- RunDEtest(srt = pancreas_sub, group_by = "CellType", fc.threshold = 1, only.pos = FALSE)
VolcanoPlot(srt = pancreas_sub, group_by = "CellType")

所有的差异分析结果存储在pancreas_sub@tools中，可以进行进一步的筛选出差异具有显著性的基因，并对这些基因注释上转录因子和表面蛋白，最后可视化：

DEGs <- pancreas_sub@tools$DEtest_CellType$AllMarkers_wilcox
DEGs <- DEGs[with(DEGs, avg_log2FC > 1 & p_val_adj < 0.05), ]
# Annotate features with transcription factors and surface proteins
pancreas_sub <- AnnotateFeatures(pancreas_sub, species = "Mus_musculus", db = c("TF", "CSPA"))
ht <- FeatureHeatmap(
  srt = pancreas_sub, group.by = "CellType", features = DEGs$gene, feature_split = DEGs$group1,
  species = "Mus_musculus", db = c("GO_BP", "KEGG", "WikiPathway"), anno_terms = TRUE,
  feature_annotation = c("TF", "CSPA"), feature_annotation_palcolor = list(c("gold", "steelblue"), c("forestgreen")),
  height = 5, width = 4
)
print(ht$plot)

11、富集分析（ORA）

上面的热图中同时也做了富集分析，也可以使用RunEnrichment单独进行计算，这里我们筛选出1.5倍foldchange且p_val_adj<0.05的差异基因做富集和不同类型的可视化（默认只显示p.adjust<0.05的显著性富集的事件）：

pancreas_sub <- RunEnrichment(
  srt = pancreas_sub, group_by = "CellType", db = "GO_BP", species = "Mus_musculus",
  DE_threshold = "avg_log2FC > log2(1.5) & p_val_adj < 0.05"
)

柱形图：

EnrichmentPlot(
  srt = pancreas_sub, group_by = "CellType", group_use = c("Ductal", "Endocrine"),
  plot_type = "bar"
)

词云图（事件中的关键词）：

EnrichmentPlot(
  srt = pancreas_sub, group_by = "CellType", group_use = c("Ductal", "Endocrine"),
  plot_type = "wordcloud"
)

词云图（事件中的关键基因）：

EnrichmentPlot(
  srt = pancreas_sub, group_by = "CellType", group_use = c("Ductal", "Endocrine"),
  plot_type = "wordcloud", word_type = "feature"
)

事件-基因网络图：

EnrichmentPlot(
  srt = pancreas_sub, group_by = "CellType", group_use = "Ductal",
  plot_type = "network"
)

所有富集事件的enrichmap

注意：需要调整plot窗口，来显示出所有的事件标签

EnrichmentPlot(
  srt = pancreas_sub, group_by = "CellType", group_use = "Ductal",
  plot_type = "enrichmap"
)

各组富集事件比较的点图：

EnrichmentPlot(srt = pancreas_sub, group_by = "CellType", plot_type = "comparison")

12、富集分析（GSEA）

之前运行RunDEtest时fc.threshold为1，也就是foldchange不作为基因的筛选标准，只使用其余标准进行筛选，例如min.pct>0.1。差异检验后可以将p_val_adj < 0.05的所有基因取出，作为候选基因集进行GSEA分析：

pancreas_sub <- RunGSEA(
  srt = pancreas_sub, group_by = "CellType", db = "GO_BP", species = "Mus_musculus",
  DE_threshold = "p_val_adj < 0.05"
)
GSEAPlot(srt = pancreas_sub, group_by = "CellType", group_use = "Endocrine", id_use = "GO:0007186")

GSEAPlot(srt = pancreas_sub, group_by = "CellType", plot_type = "comparison")

ORA富集分析和GSEA富集分析的结果也均存于pancreas_sub@tools中

13、轨迹推断

SCP支持Slingshot, PAGA, scVelo, Palantir, Monocle2, Monocle3, WOT等轨迹推断方法，不过不论是什么方法都需要基于生物学知识库对算法进行合理的限制，对结果进行合理的裁剪。

以slingshot为例进行轨迹推断：

pancreas_sub <- RunSlingshot(srt = pancreas_sub, group.by = "SubCellType", reduction = "UMAP")

在这里Slingshot推断出了三条发育轨迹/谱系，起点均为Ductal/Ngn3 low EP细胞群，终点分别指向了三种内分泌细胞群Alpha, Beta以及Delta。分别查看三条轨迹的pseudotime：

FeatureDimPlot(pancreas_sub,
  features = paste0("Lineage", 1:3),
  reduction = "UMAP", theme_use = "theme_blank"
)

灰色的细胞表示不在对应的发育轨迹，其对应轨迹的pesudotime=NA。

轨迹的可视化可以根据这些pesudotime调整：

CellDimPlot(pancreas_sub,
  group.by = "SubCellType", reduction = "UMAP",
  lineages = paste0("Lineage", 1:3), lineages_span = 0.1
)

14、动态特征鉴定

对于任意算法得到的轨迹pseudotime，都可以使用RunDynamicFeatures计算出与之相关的动态基因（驱动细胞沿轨迹分化发育的基因）。

对于多条轨迹也可以同时计算动态基因并比较：

pancreas_sub <- RunDynamicFeatures(srt = pancreas_sub, lineages = c("Lineage1", "Lineage2"), n_candidates = 200)
ht <- DynamicHeatmap(
  srt = pancreas_sub, lineages = c("Lineage1", "Lineage2"),
  use_fitted = TRUE, n_split = 6, reverse_ht = "Lineage1",
  species = "Mus_musculus", db = "GO_BP", anno_terms = TRUE, anno_keys = TRUE, anno_features = TRUE,
  heatmap_palette = "viridis", cell_annotation = "SubCellType",
  separate_annotation = list("SubCellType", c("Nnat", "Irx1")), separate_annotation_palette = c("Paired", "Set1"),
  feature_annotation = c("TF", "CSPA"), feature_annotation_palcolor = list(c("gold", "steelblue"), c("forestgreen")),
  pseudotime_label = 25, pseudotime_label_color = "red",
  height = 5, width = 2
)
print(ht$plot)

或逐一查询基因在轨迹中的动态变化：

DynamicPlot(
  srt = pancreas_sub, lineages = c("Lineage1", "Lineage2"), group.by = "SubCellType",
  features = c("Plk1", "Hes1", "Neurod2", "Ghrl", "Gcg", "Ins2"),
  compare_lineages = TRUE, compare_features = FALSE
)

FeatureStatPlot(
  srt = pancreas_sub, group.by = "SubCellType", bg.by = "CellType",
  stat.by = c("Sox9", "Neurod2", "Isl1", "Rbp4"), add_box = TRUE,
  comparisons = list(
    c("Ductal", "Ngn3 low EP"),
    c("Ngn3 high EP", "Pre-endocrine"),
    c("Alpha", "Beta")
  )
)

15、交互式的单细胞数据查询网页（SCExplorer）

最后通过SCExplorer，可以快速的共享单细胞数据到云端，实现交互式的数据查询：

PrepareSCExplorer(list(mouse_pancreas = pancreas_sub, human_pancreas = panc8_sub), base_dir = "./SCExplorer")
app <- RunSCExplorer(base_dir = "./SCExplorer")
list.files("./SCExplorer") # This directory can be used as site directory for Shiny Server.

if (interactive()) {
  shiny::runApp(app)
}