本系列讲解 使用 Scanpy
分析单细胞(scRNA-seq)数据 教程[1],持续更新,欢迎关注,转发!
scanpy
在 experimental.pp
模块中引入了基于 Pearson 残差的新预处理函数。
本文的第一部分通过在两个示例数据集上演示新核心函数的用法。第二部分,我们简要解释了可选参数及其默认设置。最后,简要讨论了一次性运行整个 Pearson 残差流程的两个包装函数。
简而言之,Pearson 残差将原始 UMI 计数转换为一种表示,其中实现了三个目标:
因此,计算 Pearson 残差取代了常见的显式按测序深度归一化和为稳定方差而对数据进行 log 转换的步骤。
这里提出的分析 Pearson 残差与 Seurat
的 scTransform
模型类似,但使用了允许解析解的简化模型。
import numpy as np
import matplotlib.pyplot as plt
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor="white")
本文处理两个 10X 数据集:
创建目录、下载并解压数据:
mkdir tutorial_data
mkdir tutorial_data/pbmc3k_v1
mkdir tutorial_data/pbmc10k_v3
wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O tutorial_data/pbmc3k_v1.tar.gz
cd tutorial_data; tar -xzf pbmc3k_v1.tar.gz -C pbmc3k_v1 --strip-components 2
wget https://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_10k_v3/pbmc_10k_v3_filtered_feature_bc_matrix.tar.gz -O tutorial_data/pbmc10k_v3.tar.gz
cd tutorial_data; tar -xzf pbmc10k_v3.tar.gz -C pbmc10k_v3 --strip-components 1
在这里,我们从磁盘加载两个下载的数据集,并为它们创建AnnData
对象。
adata_pbmc3k = sc.read_10x_mtx("tutorial_data/pbmc3k_v1/", cache=True)
adata_pbmc10k = sc.read_10x_mtx("tutorial_data/pbmc10k_v3/", cache=True)
adata_pbmc3k.uns["name"] = "PBMC 3k (v1)"
adata_pbmc10k.uns["name"] = "PBMC 10k (v3)"
为了证明在这些 PBMC 数据集上,Pearson 残差能够挑选出有意义的基因,我们将把选出的基因与 PBMC3k 教程中鉴定的一组 marker genes 进行比较。它们与 PBMC 细胞类型的对应关系如下:
['IL7R', # CD4 T cells
'LYZ', 'CD14', # CD14+ Monocytes
'MS4A1', # B cells
'CD8A', # CD8 T cells
'GNLY', 'NKG7', # NK cells
'FCGR3A', 'MS4A7', # FCGR3A+ Monocytes
'FCER1A', 'CST3', # Dendritic Cells
'PPBP'] # Megakaryocytes
好的基因选择应包括这些差异表达的基因。
# marker genes from table in pbmc3k tutorial
markers = [
"IL7R",
"LYZ",
"CD14",
"MS4A1",
"CD8A",
"GNLY",
"NKG7",
"FCGR3A",
"MS4A7",
"FCER1A",
"CST3",
"PPBP",
]
首先,我们移除计数过少的细胞和基因,然后剔除离群细胞。
for adata in [adata_pbmc3k, adata_pbmc10k]:
adata.var_names_make_unique()
print(adata.uns["name"], ": data shape:", adata.shape)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
我们计算每个细胞检测到的基因数目、每个细胞的总计数以及每个细胞中线粒体基因的百分比。
for adata in [adata_pbmc3k, adata_pbmc10k]:
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)
我们绘制所有指标,并观察到两个数据集均存在一些异常细胞。
for adata in [adata_pbmc3k, adata_pbmc10k]:
print(adata.uns["name"], ":")
sc.pl.violin(
adata,
["n_genes_by_counts", "total_counts", "pct_counts_mt"],
jitter=0.4,
multi_panel=True,
)
根据这些指标,我们定义异常细胞并将其剔除。随后,确保所有基因在剩余的细胞中至少被检测到一次。
# define outliers and do the filtering for the 3k dataset
adata_pbmc3k.obs["outlier_mt"] = adata_pbmc3k.obs.pct_counts_mt > 5
adata_pbmc3k.obs["outlier_total"] = adata_pbmc3k.obs.total_counts > 5000
adata_pbmc3k.obs["outlier_ngenes"] = adata_pbmc3k.obs.n_genes_by_counts > 2500
print(
"%u cells with high %% of mitochondrial genes"
% (sum(adata_pbmc3k.obs["outlier_mt"]))
)
print("%u cells with large total counts" % (sum(adata_pbmc3k.obs["outlier_total"])))
print("%u cells with large number of genes" % (sum(adata_pbmc3k.obs["outlier_ngenes"])))
adata_pbmc3k = adata_pbmc3k[~adata_pbmc3k.obs["outlier_mt"], :]
adata_pbmc3k = adata_pbmc3k[~adata_pbmc3k.obs["outlier_total"], :]
adata_pbmc3k = adata_pbmc3k[~adata_pbmc3k.obs["outlier_ngenes"], :]
sc.pp.filter_genes(adata_pbmc3k, min_cells=1)
# define outliers and do the filtering for the 10k dataset
adata_pbmc10k.obs["outlier_mt"] = adata_pbmc10k.obs.pct_counts_mt > 20
adata_pbmc10k.obs["outlier_total"] = adata_pbmc10k.obs.total_counts > 25000
adata_pbmc10k.obs["outlier_ngenes"] = adata_pbmc10k.obs.n_genes_by_counts > 6000
print(
"%u cells with high %% of mitochondrial genes"
% (sum(adata_pbmc10k.obs["outlier_mt"]))
)
print("%u cells with large total counts" % (sum(adata_pbmc10k.obs["outlier_total"])))
print(
"%u cells with large number of genes" % (sum(adata_pbmc10k.obs["outlier_ngenes"]))
)
adata_pbmc10k = adata_pbmc10k[~adata_pbmc10k.obs["outlier_mt"], :]
adata_pbmc10k = adata_pbmc10k[~adata_pbmc10k.obs["outlier_total"], :]
adata_pbmc10k = adata_pbmc10k[~adata_pbmc10k.obs["outlier_ngenes"], :]
sc.pp.filter_genes(adata_pbmc10k, min_cells=1)
分析型 Pearson 残差可用于识别具有生物学变异的基因。为此,将观测到的计数与“零模型”下的期望计数进行比较。该零模型假设细胞间没有生物学变异。Pearson 残差的定义确保:未被差异表达的基因其残差方差接近 1;相反,若某基因被差异表达,则会偏离零模型,导致残差更大且残差方差 >1。
调用 highly_variable_genes(flavor='pearson_residuals', n_top_genes=2000)
将计算残差方差,并据此选出 2000 个基因。如下方的图所示,已知细胞类型 marker genes 均被成功选中。
这一步会生成 highly_variable
字段,标记出 Pearson 残差变异最大的 2000 个基因。
for adata in [adata_pbmc3k, adata_pbmc10k]:
sc.experimental.pp.highly_variable_genes(
adata, flavor="pearson_residuals", n_top_genes=2000
)
为了展示筛选流程,我们绘制每个基因的均值与残差方差,并将被选中的基因用红色高亮。在其上方,我们再绘制先前定义的已知 marker genes(黑色)。可以看到,所有已知 marker genes 均被成功选中,符合预期。
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
for ax, adata in zip(axes, [adata_pbmc3k, adata_pbmc10k]):
hvgs = adata.var["highly_variable"]
ax.scatter(
adata.var["mean_counts"], adata.var["residual_variances"], s=3, edgecolor="none"
)
ax.scatter(
adata.var["mean_counts"][hvgs],
adata.var["residual_variances"][hvgs],
c="tab:red",
label="selected genes",
s=3,
edgecolor="none",
)
ax.scatter(
adata.var["mean_counts"][np.isin(adata.var_names, markers)],
adata.var["residual_variances"][np.isin(adata.var_names, markers)],
c="k",
label="known marker genes",
s=10,
edgecolor="none",
)
ax.set_xscale("log")
ax.set_xlabel("mean expression")
ax.set_yscale("log")
ax.set_ylabel("residual variance")
ax.set_title(adata.uns["name"])
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.yaxis.set_ticks_position("left")
ax.xaxis.set_ticks_position("bottom")
plt.legend()
我们将两个数据集都切到只保留高变异基因。
adata_pbmc3k = adata_pbmc3k[:, adata_pbmc3k.var["highly_variable"]]
adata_pbmc10k = adata_pbmc10k[:, adata_pbmc10k.var["highly_variable"]]
adata_pbmc3k
# View of AnnData object with n_obs × n_vars = 2574 × 2000
# obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'outlier_mt', 'outlier_total', 'outlier_ngenes'
# var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable'
# uns: 'name', 'hvg'
adata_pbmc10k
# View of AnnData object with n_obs × n_vars = 10968 × 2000
# obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'outlier_mt', 'outlier_total', 'outlier_ngenes'
# var: 'gene_ids', 'feature_types', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'means', 'variances', 'residual_variances', 'highly_variable_rank', 'highly_variable'
# uns: 'name', 'hvg'
Reference
[1]
Source: https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html