cell2location的原理图:cell2location是一个分层贝叶斯模型,需要使用单细胞数据作为参考,对空间转录组数据进行解卷积。
# Loading packages
import sys
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import cell2location
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text for PDFs
空间数据来自10X Genomics的人类淋巴结Visium dataset,单细胞数据来自人的次级淋巴器官包括34种细胞类型。
空间数据为10X Space Ranger软件的标准输出文件,来自 Kleshchevnikov et al (section 4, Fig 4)
下载链接:https://support.10xgenomics.com/spatial-gene-expression/datasets/1.0.0/V1_Human_Lymph_Node
Visium空间数据格式为10X Space Ranger软件的输出,来自Kleshchevnikov et al (section 4, Fig 4),可以使用scanpy进行下载。
# 使用包下载
#adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
#adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
这个文章中使用的其实来自10X Genomics 官网: https://support.10xgenomics.com/spatial-gene-expression/datasets/1.0.0/V1_Human_Lymph_Node。去网站下载到本地后导入:
# 去网站下载到本地后导入
# Set paths to data and results used through the document:
sp_data_folder = './data/V1_Human_Lymph_Node/1.0.0/'
results_folder = './Cell2location/Result_Test/'
adata_vis = sc.read_visium(sp_data_folder, count_file='filtered_feature_bc_matrix.h5', load_images=True)
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
adata_vis.var['SYMBOL'] = adata_vis.var_names
adata_vis.var.set_index('gene_ids', drop=True, inplace=True)
注意,数据 data/V1_Human_Lymph_Node/1.0.0/ 目录结构为:
线粒体编码的基因(基因名称以前缀 MT-或 MT-开头)与空间mapping无关,因为它们的表达代表单细胞和细胞核数据中的技术artifacts,而不是线粒体的生物丰度。然而,这些基因在每个spot组成15-40% 的 mRNA 。因此,为了避免绘制artifacts,强烈建议去除线粒体基因。
# find mitochondria-encoded (MT) genes
adata_vis.var['MT_gene'] = [gene.startswith('MT-') for gene in adata_vis.var['SYMBOL']]
# remove MT genes for spatial mapping (keeping their counts in the object)
adata_vis.obsm['MT'] = adata_vis[:, adata_vis.var['MT_gene'].values].X.toarray()
adata_vis = adata_vis[:, ~adata_vis.var['MT_gene'].values]
由于患者供体年龄的原因,已发表的淋巴结scRNA-seq数据集通常缺乏生发中心相关免疫细胞群的充分代表。因此,我们将综合淋巴结、脾脏和扁桃体的scRNA-seq数据集纳入我们的单细胞参考,以确保我们捕获了空间转录组数据集中可能存在的免疫细胞状态的全部多样性。
下载地址:https://cell2location.cog.sanger.ac.uk/paper/integrated_lymphoid_organ_scrna/RegressionNBV4Torch_57covariates_73260cells_10237genes/sc.h5ad
# Read data
reg_path = './data/V1_Human_Lymph_Node/1.0.0/'
adata_ref = sc.read(f'{reg_path}sc.h5ad')
# 将数据的genes转为ENSEMBL ID,与空间数据对应同样的id
adata_ref.var['SYMBOL'] = adata_ref.var.index
# rename 'GeneID-2' as necessary for your data
adata_ref.var.set_index('GeneID-2', drop=True, inplace=True)
在我们估计参考细胞类型特征之前,我们建议进行very permissive genes选择。与标准的高可变基因选择相比,我们更喜欢这种方法,因为我们的程序保留了罕见基因的标记,同时删除了大多数无信息的基因。
from cell2location.utils.filtering import filter_genes
selected = filter_genes(adata_ref, cell_count_cutoff=5, cell_percentage_cutoff2=0.03, nonz_mean_cutoff=1.12)
# filter the object
adata_ref = adata_ref[:, selected].copy()
过滤掉的基因示意图:
每一个点表示一个基因
x轴:检测到该基因的细胞中的平均 RNA count值
y轴:表达该基因的细胞数量
橙色矩形突出显示了基于x轴和y轴的组合而排除的基因。
根据scRNA-seq数据估计特征,考虑批效应,使用负二项回归模型
首先,为回归模型准备一个数据对象:
# prepare anndata for the regression model
cell2location.models.RegressionModel.setup_anndata(adata=adata_ref,
# 10X reaction / sample / batch
batch_key='Sample',
# cell type, covariate used for constructing signatures
labels_key='Subset',
# multiplicative technical effects (platform, 3' vs 5', donor effect)
categorical_covariate_keys=['Method']
)
# create the regression model
from cell2location.models import RegressionModel
mod = RegressionModel(adata_ref)
# view anndata_setup as a sanity check
mod.view_anndata_setup()
接下来训练模型来estimate the reference cell type signatures
这一步骤没有GPU将非常耗时
## 如果你有GPU超算,可以设置use_gpu=True
mod.train(max_epochs=250, use_gpu=False)
检查模型是否需要更多的训练
在这里,我们绘制了训练期间的ELBO损失历史,从图中删除了前20个epoch。这个图在训练结束时应该有一个下降的趋势并趋于平稳。如果max_epochs仍在减小,则增加max_epochs
mod.plot_history(20)
plt.savefig(results_folder+'01-mod.plot_history.png') # 保存为PNG格式
训练结果图:
横轴为epochs,纵轴为ELBO loss,这个图在训练结束时应该有一个下降的趋势并趋于平稳。
保存训练后估计的细胞丰度,输出训练的结果
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_ref = mod.export_posterior(
adata_ref, sample_kwargs={'num_samples': 1000, 'batch_size': 2500, 'use_gpu': False}
)
# Save model
ref_run_name = './data/V1_Human_Lymph_Node/reference_signatures/'
mod.save(f"{ref_run_name}", overwrite=True)
# Save anndata object with results
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref.write(adata_file)
adata_file
保存的数据后续还可以重新加载回来:
adata_file = f"{ref_run_name}/sc.h5ad"
adata_ref = sc.read_h5ad(adata_file)
mod = cell2location.models.RegressionModel.load(f"{ref_run_name}", adata_ref)
评估标准:
1.Reconstruction accuracy图:这个二维直方图的大部分观测值应该沿着有噪声的对角线
2.图2:由于批处理效应,估计的表达特征与每个聚类的平均表达特征不同。对于不受批处理影响的scRNA-seq数据集(本数据集),可以使用聚类平均表达来代替使用模型估计signature特征。当这个图与对角线图非常不同时(例如,y轴上的值非常低,到处都是密度),这表明特征估计存在问题。
mod.plot_QC()
plt.savefig(results_folder+'02-mod.plot_QC.png') # 保存为PNG格式
评估图:
提取参考细胞类型特征为pd.DataFrame
# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_ref.varm.keys():
inf_aver = adata_ref.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
else:
inf_aver = adata_ref.var[[f'means_per_cluster_mu_fg_{i}'
for i in adata_ref.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_ref.uns['mod']['factor_names']
inf_aver.iloc[0:12, 0:12]
这个就是用单细胞数据得到的细胞类型signature了,用于空转数据的解卷积。
利用单细胞signature矩阵,对空转数据进行注释,首先提取单细胞与空转数据共同基因:
# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()
# prepare anndata for cell2location model
cell2location.models.Cell2location.setup_anndata(adata=adata_vis, batch_key="sample")
在进行空转数据注释之前,需要定义两个比较重要的参数
N_cells_per_location:每个spot 的细胞数,这个可以通过配对的 histology images进行评估,或者 通过squdipy 进行细胞分割,然后估计细胞数、或者通过 10X loupe browser 手动数
detection_alpha:建议用 20,因为从已发表的研究来看,人体组织在实验中受到的技术影响较大。如果是使用的小鼠等受技术影响低的数据集,那可以用 200。当然了,最好是两个都试试。
更多详细的说明参考:the flow diagram and the note
本教程N_cells_per_location=30,detection_alpha=20
# create and train the model
mod = cell2location.models.Cell2location(
adata_vis, cell_state_df=inf_aver,
# the expected average cell abundance: tissue-dependent
# hyper-prior which can be estimated from paired histology:
N_cells_per_location=30,
# hyperparameter controlling normalisation of
# within-experiment variation in RNA detection:
detection_alpha=20
)
mod.view_anndata_setup()
Training cell2location:此步骤也非常耗时
mod.train(max_epochs=30000,
# train using full data (batch_size=None)
batch_size=None,
# use all data points in training because
# we need to estimate cell abundance at all locations
train_size=1,
# use_gpu=True,
use_gpu=False,
)
# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(1000)
plt.legend(labels=['full data training']);
plt.savefig(results_folder+'03-mod.plot_history.png') # 保存为PNG格式
横轴为epochs,纵轴为ELBO loss,这个图在训练结束时应该有一个下降的趋势并趋于平稳。
训练完之后,将每个 spot 估计的细胞丰度,保存到空转数据中。并且保存模型:
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': mod.adata.n_obs, 'use_gpu': True}
)
# Save model
run_name = './data/V1_Human_Lymph_Node/'
mod.save(f"{run_name}", overwrite=True)
# mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
# Save anndata object with results
adata_file = f"{run_name}/sp.h5ad"
adata_vis.write(adata_file)
adata_file
数据后续又可以被加载回来:
adata_file = f"{run_name}/sp.h5ad"
adata_vis = sc.read_h5ad(adata_file)
mod = cell2location.models.Cell2location.load(f"{run_name}", adata_vis)
mapping质量评估:
检查重建精度以评估映射是否存在任何问题。图应该大致是对角线的。
mod.plot_QC()
plt.savefig(results_folder+'04-mod.plot_QC.png') # 保存为PNG格式
结果图:
查看具体的每个spot上的细胞丰度:
adata_vis.obsm['q05_cell_abundance_w_sf']
在空间上可视化:
# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']
# select one slide
from cell2location.utils import select_slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black','figure.figsize': [4.5, 5]}):
sc.pl.spatial(slide, cmap='magma',
# show first 8 cell types
color=['B_Cycling', 'B_GC_LZ', 'T_CD4+_TfH_GC', 'FDC','B_naive', 'T_CD4+_naive', 'B_plasma', 'Endo'],
ncols=4, size=1.3,img_key='hires',
# limit color scale at 99.2% quantile of cell abundance
vmin=0, vmax='p99.2'
)
plt.savefig(results_folder+'05-sc.pl.spatial.png') # 保存为PNG格式
这里选择了8种细胞进行可视化:
这里选择了三种细胞T_CD4+_naive,B_naive,FDC 进行共定位
# Now we use cell2location plotter that allows showing multiple cell types in one panel
from cell2location.plt import plot_spatial
# select up to 6 clusters
clust_labels = ['T_CD4+_naive', 'B_naive', 'FDC']
clust_col = ['' + str(i) for i in clust_labels] # in case column names differ from labels
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
with mpl.rc_context({'figure.figsize': (15, 15)}):
fig = plot_spatial(
adata=slide,
# labels to show on a plot
color=clust_col, labels=clust_labels,
show_img=True,
# 'fast' (white background) or 'dark_background'
style='fast',
# limit color scale at 99.2% quantile of cell abundance
max_color_quantile=0.992,
# size of locations (adjust depending on figure size)
circle_diameter=6,
colorbar_position='right'
)
plt.savefig(results_folder+'06-sc.pl.spatial.png') # 保存为PNG格式
结果:这个跟其他空间工具解卷积结果来看,对于同一个spot如果有多种细胞类型,这里展示的是一个 颜色的叠加而不是饼图,嗯,我感觉还是饼图可能会更直观一点,这里如果颜色选取有强弱很容易造成视觉上的错觉吧。
下次见~