前往小程序,Get更优阅读体验!
立即前往
发布
社区首页 >专栏 >python版的singler单细胞注释工具

python版的singler单细胞注释工具

作者头像
用户11414625
发布2024-12-20 16:48:54
发布2024-12-20 16:48:54
7900
代码可运行
举报
文章被收录于专栏:生信星球520生信星球520
运行总次数:0
代码可运行

网上可以搜到大量的R语言singleR的代码和教程,但python版的就比较少啦,恭喜你找到了我。

1.文件读取

输入的数据是10X标准的三个文件

代码语言:javascript
代码运行次数:0
复制
import singlecellexperiment as sce
import scanpy as sc
import os
print(os.listdir("01_data"))
代码语言:javascript
代码运行次数:0
复制
['barcodes.tsv', 'genes.tsv', 'matrix.mtx']

用read_10x_mtx读取

代码语言:javascript
代码运行次数:0
复制
adata = sc.read_10x_mtx("01_data/")
print(adata.shape)
代码语言:javascript
代码运行次数:0
复制
(2700, 32738)

2. 质控

代码语言:javascript
代码运行次数:0
复制
sc.pp.filter_cells(adata,min_genes=200)
sc.pp.filter_genes(adata,min_cells=3)
adata.var['mt']=adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata,qc_vars=['mt'],log1p=False,percent_top=None,inplace=True)
sc.pl.violin(adata,["n_genes_by_counts", "total_counts", "pct_counts_mt"],jitter=0.4, multi_panel=True)

adata=adata[adata.obs.n_genes_by_counts>200]
adata=adata[adata.obs.n_genes_by_counts<2500]
adata=adata[adata.obs.pct_counts_mt<20]

print(adata.shape)
代码语言:javascript
代码运行次数:0
复制
(2693, 13714)

3.降维聚类分群

代码语言:javascript
代码运行次数:0
复制
sc.pp.normalize_total(adata,target_sum=1e4)
sc.pp.log1p(adata)
adata.raw=adata

sc.pp.highly_variable_genes(adata,n_top_genes=2000)
sc.pp.scale(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata,n_pcs=15)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2,resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(adata,color='leiden')

4.singler自动注释

singler的资料实在太少,文档也很简洁,我学习到这个地方时,请教了包的作者两个问题:

1.如何按照cluster完成注释?

作者回答可以用scranpy的aggregate_across_cells函数按簇整合;

Q: In the R package singleR, I am able to utilize the cluster parameter; however, it appears that this parameter does not exist in the Python version of singler.Did I miss anything? A: scranpy has an aggregate_across_cells() function that you can use to get the aggregated matrix that can be used in classify_single_reference(). That should be the same as what SingleR::SingleR() does under the hood. I suppose we could add this argument, but to be honest, the only reason that cluster= still exists in SingleR() is for back-compatibility purposes. It's easy enough to do the aggregation outside of the function and I don't want to add more responsibilities to the singler package.

2.应该选择raw count还是lognormalized data 还是scaled data?

作者回答都可以

Q: Thank you. I've been learning singler recently. According to the quick start guide on the pip website,the test_data parameter seems to require the original count data:

代码语言:javascript
代码运行次数:0
复制
data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
mat = data.assay("counts")

However, the R version of SingleR typically uses log-normalized data. The documentation also mentions,”or if you are coming from scverse ecosystem, i.e. AnnData, simply read the object as SingleCellExperiment and extract the matrix and the features.“,but data processed with Scanpy could be extracted as scaled data. Could you provide advice on which matrix I should use, or if either would be suitable?

A: For the test dataset, it doesn't matter. Only the ranks of the values are used by SingleR itself, so it will give the same results for any monotonic transformation within each cell.

IIRC the only place where the log/normalization-status makes a difference is in SingleR::plotMarkerHeatmap() (R package only, not in the Python package yet) which computes log-fold changes in the test dataset to prioritize the markers to be visualized in the heatmap. This is for diagnostic purposes only.

Of course, the reference dataset should always be some kind of log-normalized value, as log-fold changes are computed via the difference of means, e.g., with getClassicMarkers().

其实使用哪个数据还是会产生一些差别的,我们就沿用log-normalized数据吧(当然其他的也可以)

代码语言:javascript
代码运行次数:0
复制
mat = adata.raw.X.T # 矩阵
features = list(adata.raw.var.index) #矩阵的行名-基因
代码语言:javascript
代码运行次数:0
复制
import scranpy
m2 = scranpy.aggregate_across_cells(mat,adata.obs['leiden']) #按照聚类结果整合单细胞矩阵
m2
代码语言:javascript
代码运行次数:0
复制
SummarizedExperiment(number_of_rows=13714, number_of_columns=8, assays=['sums', 'detected'], row_data=BiocFrame(data={}, number_of_rows=13714, column_names=[]), column_data=BiocFrame(data={'factor_1': StringList(data=['0', '2', '3', '4', '1', '5', '6', '7']), 'counts': array([452, 350, 226, 252, 713, 226, 450,  24], dtype=int32)}, number_of_rows=8, column_names=['factor_1', 'counts']), column_names=['0', '2', '3', '4', '1', '5', '6', '7'])

查看都有哪些可选的注释

代码语言:javascript
代码运行次数:0
复制
import celldex
refs = celldex.list_references() #这句也有可能因为网络问题而报错,不过可以不运行,只是为了知道下面可以写什么注释和什么版本。
print(refs[["name", "version"]])
代码语言:javascript
代码运行次数:0
复制
                        name     version
0                       dice  2024-02-26
1           blueprint_encode  2024-02-26
2                     immgen  2024-02-26
3               mouse_rnaseq  2024-02-26
4                       hpca  2024-02-26
5  novershtern_hematopoietic  2024-02-26
6              monaco_immune  2024-02-26

celldex的参考数据是会下载的,经常有网络问题下载困难,导致运行失败,可以存本地文件,只有第一次运行时会下载,但要注意换了参考数据则fr和fetch_reference里两处要修改

代码语言:javascript
代码运行次数:0
复制
import os
import pickle

fr = "ref_blueprint_encode_data.pkl" 
if not os.path.exists(fr):
    ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
    with open(fr, 'wb') as file:
        pickle.dump(ref_data, file)
else:
    with open(fr, 'rb') as file:
        ref_data = pickle.load(file)

完成注释

代码语言:javascript
代码运行次数:0
复制
import singler
results = singler.annotate_single(
    test_data = m2,
    test_features = features,
    ref_data = ref_data,
    ref_labels = "label.main"
)

将注释结果添加到anndata对象里,并画图

代码语言:javascript
代码运行次数:0
复制
dd = dict(zip(list(m2.column_data.row_names), results['best']))
dd
代码语言:javascript
代码运行次数:0
复制
{'0': 'CD8+ T-cells',
 '2': 'B-cells',
 '3': 'Monocytes',
 '4': 'NK cells',
 '1': 'CD4+ T-cells',
 '5': 'CD8+ T-cells',
 '6': 'Monocytes',
 '7': 'Monocytes'}
代码语言:javascript
代码运行次数:0
复制
adata.obs['singler']=adata.obs['leiden'].map(dd)

sc.pl.umap(adata,color = 'singler')

自动注释不一定是完全准确的,你换一个参考数据也会发现结果会变。发现有问题就要结合背景知识(比如marker基因)去检查一下。

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2024-11-27,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 生信星球 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 1.文件读取
  • 2. 质控
  • 3.降维聚类分群
  • 4.singler自动注释
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档