scater 这个R包很强大,是McCarthy et al. 2017 发表的,包含的功能有:
主要是基于 SCESet 对象来进行下游分析,跟ExpressionSet对象类似,也是常见的3个组成:
主要就是读取scRNA上游分析处理得到的表达矩阵,加上每个样本的描述信息,形成矩阵之后。对样本进行过滤,然后对基因进行过滤。针对过滤后的表达矩阵进行各种分类的可视化。
最新的文档如下:
HTML | R Script | An introduction to the scater package |
---|---|---|
HTML | R Script | Data visualisation methods in scater |
HTML | R Script | Expression quantification and import |
HTML | R Script | Quality control with scater |
HTML | R Script | Transition from SCESet to SingleCellExperiment |
其GitHub上面专门开设了介绍它用法的课程:http://hemberg-lab.github.io/scRNA.seq.course/
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
exprs(example_sce) <- log2(
calculateCPM(example_sce, use.size.factors = FALSE) + 1
)
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
example_sce <- calculateQCMetrics(example_sce,
feature_controls = list(eg = 1:40))
#scater_gui(example_sce)
但是真的非常好用,所有的可视化都集中在了 scater_gui
这个函数产生的shiny
网页里面:
plotScater
: a plot method exists for SingleCellExperiment
objects, which gives an overview of expression across cells.plotQC
: various methods are available for producing QC diagnostic plots.plotPCA
: produce a principal components plot for the cells.plotTSNE
: produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells.plotDiffusionMap
: produce a diffusion map (reduced dimension) plot for the cells.plotMDS
: produce a multi-dimensional scaling plot for the cells.plotReducedDim
: plot a reduced-dimension representation of the cells.plotExpression
: plot expression levels for a defined set of features.plotPlatePosition
: plot cells in their position on a plate, coloured by cell metadata and QC metrics or feature expression level.plotColData
: plot cell metadata and QC metrics.plotRowData
: plot feature metadata and QC metrics.可以充分的探索自己的数据,随便看一个可视化函数的结果:
## ----plot-expression, eval=TRUE--------------------------------------------
plotExpression(example_sce, rownames(example_sce)[1:6],
x = "Mutation_Status", exprs_values = "exprs",
colour = "Treatment")
做QC要结合上面的可视化步骤,所有没办法自动化,只能先可视化,肉眼分辨一下哪些样本或者基因数据是需要舍弃的。
library(knitr)
opts_chunk$set(fig.align = 'center', fig.width = 6, fig.height = 5, dev = 'png')
library(ggplot2)
theme_set(theme_bw(12))
## ----quickstart-load-data, message=FALSE, warning=FALSE--------------------
suppressPackageStartupMessages(library(scater))
data("sc_example_counts")
data("sc_example_cell_info")
## ----quickstart-make-sce, results='hide'-----------------------------------
gene_df <- DataFrame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
example_sce <- SingleCellExperiment(assays = list(counts = sc_example_counts),
colData = sc_example_cell_info,
rowData = gene_df)
example_sce <- normalise(example_sce)
## Warning in .local(object, ...): using library sizes as size factors
## ----quickstart-add-exprs, results='hide'----------------------------------
exprs(example_sce) <- log2(
calculateCPM(example_sce, use.size.factors = FALSE) + 1)
## ----filter-no-exprs-------------------------------------------------------
keep_feature <- rowSums(exprs(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
example_sceset <- calculateQCMetrics(example_sce, feature_controls = list(eg = 1:40))
colnames(colData(example_sceset))
## [1] "Cell"
## [2] "Mutation_Status"
## [3] "Cell_Cycle"
## [4] "Treatment"
## [5] "total_features"
## [6] "log10_total_features"
## [7] "total_counts"
## [8] "log10_total_counts"
## [9] "pct_counts_top_50_features"
## [10] "pct_counts_top_100_features"
## [11] "pct_counts_top_200_features"
## [12] "pct_counts_top_500_features"
## [13] "total_features_endogenous"
## [14] "log10_total_features_endogenous"
## [15] "total_counts_endogenous"
## [16] "log10_total_counts_endogenous"
## [17] "pct_counts_endogenous"
## [18] "pct_counts_top_50_features_endogenous"
## [19] "pct_counts_top_100_features_endogenous"
## [20] "pct_counts_top_200_features_endogenous"
## [21] "pct_counts_top_500_features_endogenous"
## [22] "total_features_feature_control"
## [23] "log10_total_features_feature_control"
## [24] "total_counts_feature_control"
## [25] "log10_total_counts_feature_control"
## [26] "pct_counts_feature_control"
## [27] "total_features_eg"
## [28] "log10_total_features_eg"
## [29] "total_counts_eg"
## [30] "log10_total_counts_eg"
## [31] "pct_counts_eg"
## [32] "is_cell_control"
colnames(rowData(example_sceset))
## [1] "Gene" "is_feature_control"
## [3] "is_feature_control_eg" "mean_counts"
## [5] "log10_mean_counts" "rank_counts"
## [7] "n_cells_counts" "pct_dropout_counts"
## [9] "total_counts" "log10_total_counts"
首先是基于样本的过滤,用 colData(object)
可以查看各个样本统计情况
total_counts
: total number of counts for the cell (aka ‘library size’)log10_total_counts
: total_counts on the log10-scaletotal_features
: the number of features for the cell that have expression above the detection limit (default detection limit is zero)filter_on_total_counts
: would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?filter_on_total_features
: would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?counts_feature_controls
: total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.counts_endogenous_features
: total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts
if no control features are indicated.log10_counts_feature_controls
: total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.log10_counts_endogenous_features
: total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.n_detected_feature_controls
: number of defined feature controls that have expression greater than the threshold defined in the object. *pct_counts_feature_controls
: percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.然后是基于基因的过滤,用 rowData(object)
可以查看各个基因统计情况
mean_exprs
: the mean expression level of the gene/feature.exprs_rank
: the rank of the feature’s expression level in the cell.total_feature_counts
: the total number of counts mapped to that feature across all cells.log10_total_feature_counts
: total feature counts on the log10-scale.pct_total_counts
: the percentage of all counts that are accounted for by the counts mapping to the feature.is_feature_control
: is the feature a control feature? Default is FALSE
unless control features are defined by the user.n_cells_exprs
: the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).scater包自己提供了一个基于PCA的QC标准,不需要自己根据文库大小,覆盖的基因数量,外源的ERCC spike-ins 含量以及线粒体DNA含量来进行人工过滤。
默认的筛选条件如下:
一站式QC函数如下:
dat_pca <- scater::plotPCA(dat_qc,
size_by = "total_features",
shape_by = "use",
pca_data_input = "pdata",
detect_outliers = TRUE,
return_SCESet = TRUE)
还有更详细的教程,需要看
sessionInfo()
过滤只是它最基本的工具,它作为单细胞转录组3大R包,功能肯定是非常全面的,比如前面我们讲解的normalization,DEG, features selection,cluster,它都手到擒来,只不过是包装的是其它R包的函数。