
为什么要介绍python版本的空转分析流程?
两个方面的原因:第一是有部分小伙伴的数据分析生态是在python中进行的。第二是类似于scRNA分析的时候,有些功能分析可能R和python不能兼顾,所以必须认识两种常见的分析方式,便于互相转化,以及方便拓展到其他的分析。
关于空转数据的分析,是scanpy(表达矩阵处理流程)+squidpy(空间图像处理)结合。关于spots矩阵的处理与scanpy对于scRNA的处理是类似的。scanpy对于空转数据不再添加相应的功能,而是将功能转移到更加专注空转数据的Scverse ecosystem。SquidPy是一个用于空间分子数据分析与可视化的工具。它构建在Scanpy和AnnData 之上,继承了它们的模块化设计和可扩展性。SquidPy提供了能够利用数据中空间坐标的分析工具,并在可用的情况下结合组织图像进行分析。它支持多个平台空转数据的处理,参照https://squidpy.readthedocs.io/en/stable/index.html。这里我们首先介绍visium数据的处理。关于R版的visium数据处理参考:跟着Nature学空转分析(一):10x visium多样本基础分析的再次回顾---解析生孩子的困境,以及seurat官网!
conda create -n squidpy python=3.10 -y
conda activate squidpy
pip install squidpy -i https://pypi.tuna.tsinghua.edu.cn/simpleimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import anndata as ad
import scanpy as sc
import squidpy as sq数据的读入使用sq.read.visium函数,类似于seurat的Load10X_Spatial,要求的数据格式也是类似的,需要有表达矩阵filtered_feature_bc_matrix.h5 or raw_feature_bc_matrix.h5,还有一个spatial文件夹。标准的情况下,sq.read.visium只需要提供sample所在路径即可,就会直接读取。
adata_test = sq.read.visium(path='./visium_data/',#表达矩阵及spitial文件所在目录
counts_file='filtered_feature_bc_matrix.h5',#表达矩阵,可以是filter或者raw
library_id='A1_brain',#设定的用于设别sample的id,在多样本整合的时候区分sample
load_images=True)#是否加载图片,False则只加载坐标,不加载image
adata_test
AnnData object with n_obs × n_vars = 10878 × 18085
obs: 'in_tissue', 'array_row', 'array_col'
var: 'gene_ids', 'feature_types', 'genome'
uns: 'spatial'
obsm: 'spatial'#读入结果结构解析
anndata.AnnData.obsm ['spatial'] - spatial spot coordinates.
anndata.AnnData.uns ['spatial']['{library_id}']['images'] - hires and lowres images.
anndata.AnnData.uns ['spatial']['{library_id}']['scalefactors'] - scale factors for the spots.
anndata.AnnData.uns ['spatial']['{library_id}']['metadata'] - various metadata.sq.pl.spatial_scatter(adata_test)
非常糟糕的是,事实并非总如意,公共数据库提供的表达矩阵基本都是10x标准的三个文件,不能直接读取,只能采取迂回路线:先一步步解析读取,之后建立一个通用函数!
#先读入表达矩阵
adata = sc.read_10x_mtx(
path="./CTR4_processed_data/filtered_feature_bc_matrix/",
var_names="gene_symbols",
make_unique=True)
adata.var_names_make_unique()
adata#读入tissue_positions文件并过滤在组织上的spots
pos = pd.read_csv('./CTR4_processed_data/spatial/tissue_positions.csv')
pos = pos.set_index("barcode")
pos = pos.loc[pos.index.isin(adata.obs_names)]#添加坐标
adata.obs = adata.obs.merge(pos, left_index=True, right_index=True, how="left")
pxr, pxc = "pxl_row_in_fullres", "pxl_col_in_fullres"
adata.obsm["spatial"] = adata.obs[[pxc, pxr]].to_numpy()#添加image
import json
import os
from PIL import Image
with open(os.path.join('./CTR4_processed_data/spatial/scalefactors_json.json'), "r") as fh:
scales = json.load(fh)
img_hires = np.array(Image.open(os.path.join('./CTR4_processed_data/spatial/tissue_hires_image.png')))
img_low = np.array(Image.open(os.path.join('./CTR4_processed_data/spatial/tissue_lowres_image.png')))
#加入adata
lib_id = 'CTR4'
adata.uns["spatial"] = {
lib_id: {"images": {"hires": img_hires, "lowres": img_low}, "scalefactors": scales, "metadata": {}}
}adataAnnData object with n_obs × n_vars = 2018 × 36601
obs: 'in_tissue', 'array_row', 'array_col', 'pxl_row_in_fullres', 'pxl_col_in_fullres'
var: 'gene_ids', 'feature_types'
uns: 'spatial'
obsm: 'spatial'sq.pl.spatial_scatter(adata)
很显然,上述步骤很多,对于一个文件还行,如果有多个样本重复上述步骤将会是一个很麻烦的过程,所以我们将其整理为一个简单的函数,适用于读取visium数据在表达矩阵只有10x 3个标准数据的情况,参数只需要提取矩阵所在路径,空间图像spatial文件路径,以及设置sample id即可,返回的anndata.AnnData数据结构与sq.read.visium函数读取的一致
adata1 = ks_read_visium_mtx(count_dir='./CTR4_processed_data/filtered_feature_bc_matrix/',
spatial_dir='./CTR4_processed_data/spatial/',
lib_id='CTR4')
adata1AnnData object with n_obs × n_vars = 2018 × 36601
obs: 'in_tissue', 'array_row', 'array_col', 'pxl_row_in_fullres', 'pxl_col_in_fullres'
var: 'gene_ids', 'feature_types'
uns: 'spatial'
obsm: 'spatial'#读取数据没有问题
sq.pl.spatial_scatter(adata1)
关于QC还是对spots矩阵的QC,流程和方式与scRNA的处理类似!
#计算线粒体基因
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)#直方图可视化质控前数据情况
import seaborn as sns
fig, axs = plt.subplots(1, 4, figsize=(15, 4))
sns.histplot(adata.obs["total_counts"], kde=False, ax=axs[0])
sns.histplot(
adata.obs["total_counts"][adata.obs["total_counts"] > 500],
kde=False,
bins=40,
ax=axs[1],
)
sns.histplot(adata.obs["n_genes_by_counts"], kde=False, bins=60, ax=axs[2])
sns.histplot(
adata.obs["n_genes_by_counts"][adata.obs["n_genes_by_counts"] < 6000],
kde=False,
bins=60,
ax=axs[3],
)
#QC,过滤基因及spots
#质控指标这里是我们自定设定,实际需要按照自己的数据进行设置调整,非统一标准
sc.pp.filter_cells(adata, min_counts=500)
adata = adata[adata.obs["pct_counts_mt"] < 20].copy()
sc.pp.filter_genes(adata, min_cells=10)#对比seurat读取的情况,这个结果与seurat运行结果一致
sq.pl.spatial_scatter(
adata,
color="total_counts",
cmap='Spectral_r',
size=1,
)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=3000, flavor="seurat")# Run PCA
sc.pp.pca(adata, n_comps=50)
# Calculate neighbors
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
# Leiden clustering
sc.tl.leiden(adata, resolution=0.8, key_added="clusters")
# Computing UMAP
sc.tl.umap(adata)sc.pl.umap(adata, color=["clusters"], wspace=0.4)
sc.pl.spatial(adata, img_key="lowres", color="clusters", size=1.5)
#可视化clsuter
from matplotlib.colors import ListedColormap
sq.pl.spatial_scatter(adata,color="clusters",size=1.5,
palette=ListedColormap(["#11A579", "#F2B701", "#66C5CC", "#80BA5A", "#F6CF71", "#7F3C8D"]))
#可视化gene expression
sq.pl.spatial_scatter(adata, color=["PAEP", "SPP1"])
#选择性展示部分clusters
sq.pl.spatial_scatter(
adata,
color="clusters",
groups=["0", "1"],
size=1.3,
)
总体来说最终的目的一致,不论是seurat还是python都有其长处,不过对于大数据python更具优势,一般分析哪个顺手使用哪个!
sc.tl.rank_genes_groups(adata, groupby = "clusters", method="wilcoxon")
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
# 将结果转换为 DataFrame.
marker_genes_df = sc.get.rank_genes_groups_df(adata, group=None, key='rank_genes_groups')
print(marker_genes_df.head())#筛选显著性的结果
significant_markers = marker_genes_df[(marker_genes_df['pvals_adj'] < 0.05) & (marker_genes_df['logfoldchanges'] > 1)]
significant_markers#保存结果
adata.write_h5ad('./adata_spatial_endometrium.h5ad')觉得分享有用的点个赞再走呗!