文章/答案/技术大牛

发布

社区首页 >专栏 >python版的singler单细胞注释工具

python版的singler单细胞注释工具

用户11414625

发布于 2024-12-20 08:48:54

20600

代码可运行

文章被收录于专栏：生信星球520生信星球520

运行总次数：0

代码可运行

网上可以搜到大量的R语言singleR的代码和教程，但python版的就比较少啦，恭喜你找到了我。

1.文件读取

输入的数据是10X标准的三个文件

import singlecellexperiment as sce
import scanpy as sc
import os
print(os.listdir("01_data"))

['barcodes.tsv', 'genes.tsv', 'matrix.mtx']

用read_10x_mtx读取

adata = sc.read_10x_mtx("01_data/")
print(adata.shape)

(2700, 32738)

2. 质控

sc.pp.filter_cells(adata,min_genes=200)
sc.pp.filter_genes(adata,min_cells=3)
adata.var['mt']=adata.var_names.str.startswith('MT-')
sc.pp.calculate_qc_metrics(adata,qc_vars=['mt'],log1p=False,percent_top=None,inplace=True)
sc.pl.violin(adata,["n_genes_by_counts", "total_counts", "pct_counts_mt"],jitter=0.4, multi_panel=True)

adata=adata[adata.obs.n_genes_by_counts>200]
adata=adata[adata.obs.n_genes_by_counts<2500]
adata=adata[adata.obs.pct_counts_mt<20]

print(adata.shape)

(2693, 13714)

3.降维聚类分群

sc.pp.normalize_total(adata,target_sum=1e4)
sc.pp.log1p(adata)
adata.raw=adata

sc.pp.highly_variable_genes(adata,n_top_genes=2000)
sc.pp.scale(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata,n_pcs=15)
sc.tl.leiden(adata,flavor="igraph",n_iterations=2,resolution=0.5)
sc.tl.umap(adata)
sc.pl.umap(adata,color='leiden')

4.singler自动注释

singler的资料实在太少，文档也很简洁，我学习到这个地方时，请教了包的作者两个问题：

1.如何按照cluster完成注释？

作者回答可以用scranpy的aggregate_across_cells函数按簇整合；

Q: In the R package singleR, I am able to utilize the cluster parameter; however, it appears that this parameter does not exist in the Python version of singler.Did I miss anything？ A: scranpy has an aggregate_across_cells() function that you can use to get the aggregated matrix that can be used in classify_single_reference(). That should be the same as what SingleR::SingleR() does under the hood. I suppose we could add this argument, but to be honest, the only reason that cluster= still exists in SingleR() is for back-compatibility purposes. It's easy enough to do the aggregation outside of the function and I don't want to add more responsibilities to the singler package.

2.应该选择raw count还是lognormalized data 还是scaled data?

作者回答都可以

Q: Thank you. I've been learning singler recently. According to the quick start guide on the pip website,the test_data parameter seems to require the original count data:

data = sce.read_tenx_h5("pbmc4k-tenx.h5", realize_assays=True)
mat = data.assay("counts")

However, the R version of SingleR typically uses log-normalized data. The documentation also mentions,”or if you are coming from scverse ecosystem, i.e. AnnData, simply read the object as SingleCellExperiment and extract the matrix and the features.“，but data processed with Scanpy could be extracted as scaled data. Could you provide advice on which matrix I should use, or if either would be suitable?

A: For the test dataset, it doesn't matter. Only the ranks of the values are used by SingleR itself, so it will give the same results for any monotonic transformation within each cell.

IIRC the only place where the log/normalization-status makes a difference is in SingleR::plotMarkerHeatmap() (R package only, not in the Python package yet) which computes log-fold changes in the test dataset to prioritize the markers to be visualized in the heatmap. This is for diagnostic purposes only.

Of course, the reference dataset should always be some kind of log-normalized value, as log-fold changes are computed via the difference of means, e.g., with getClassicMarkers().

其实使用哪个数据还是会产生一些差别的，我们就沿用log-normalized数据吧（当然其他的也可以）

mat = adata.raw.X.T # 矩阵
features = list(adata.raw.var.index) #矩阵的行名-基因

import scranpy
m2 = scranpy.aggregate_across_cells(mat,adata.obs['leiden']) #按照聚类结果整合单细胞矩阵
m2

SummarizedExperiment(number_of_rows=13714, number_of_columns=8, assays=['sums', 'detected'], row_data=BiocFrame(data={}, number_of_rows=13714, column_names=[]), column_data=BiocFrame(data={'factor_1': StringList(data=['0', '2', '3', '4', '1', '5', '6', '7']), 'counts': array([452, 350, 226, 252, 713, 226, 450,  24], dtype=int32)}, number_of_rows=8, column_names=['factor_1', 'counts']), column_names=['0', '2', '3', '4', '1', '5', '6', '7'])

查看都有哪些可选的注释

import celldex
refs = celldex.list_references() #这句也有可能因为网络问题而报错，不过可以不运行，只是为了知道下面可以写什么注释和什么版本。
print(refs[["name", "version"]])

                        name     version
0                       dice  2024-02-26
1           blueprint_encode  2024-02-26
2                     immgen  2024-02-26
3               mouse_rnaseq  2024-02-26
4                       hpca  2024-02-26
5  novershtern_hematopoietic  2024-02-26
6              monaco_immune  2024-02-26

celldex的参考数据是会下载的，经常有网络问题下载困难,导致运行失败，可以存本地文件，只有第一次运行时会下载,但要注意换了参考数据则fr和fetch_reference里两处要修改

import os
import pickle

fr = "ref_blueprint_encode_data.pkl" 
if not os.path.exists(fr):
    ref_data = celldex.fetch_reference("blueprint_encode", "2024-02-26", realize_assays=True)
    with open(fr, 'wb') as file:
        pickle.dump(ref_data, file)
else:
    with open(fr, 'rb') as file:
        ref_data = pickle.load(file)

完成注释

import singler
results = singler.annotate_single(
    test_data = m2,
    test_features = features,
    ref_data = ref_data,
    ref_labels = "label.main"
)

将注释结果添加到anndata对象里，并画图

dd = dict(zip(list(m2.column_data.row_names), results['best']))
dd

{'0': 'CD8+ T-cells',
 '2': 'B-cells',
 '3': 'Monocytes',
 '4': 'NK cells',
 '1': 'CD4+ T-cells',
 '5': 'CD8+ T-cells',
 '6': 'Monocytes',
 '7': 'Monocytes'}

adata.obs['singler']=adata.obs['leiden'].map(dd)

sc.pl.umap(adata,color = 'singler')

自动注释不一定是完全准确的，你换一个参考数据也会发现结果会变。发现有问题就要结合背景知识（比如marker基因）去检查一下。

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2024-11-27，如有侵权请联系 cloudcommunity@tencent.com 删除

工具

本文分享自生信星球微信公众号，前往查看

如有侵权，请联系 cloudcommunity@tencent.com 删除。

本文参与腾讯云自媒体同步曝光计划，欢迎热爱写作的你一起参与！

登录后参与评论

暂无评论

编辑精选文章

换一批

万字详解高可用架构设计

5149

Go 开发者必备：Protocol Buffers 入门指南

亿级月活的社交 APP，陌陌如何做到 3 分钟定位故障？

3090

60页PPT全解：DeepSeek系列论文技术要点整理

4407

python单细胞学习笔记-day9(发在Science的celltypist软件注释)

model 软件数据学习笔记 python

在之前的conda环境中安装新的本次上课需要的几个包：（建议先不要全都安装上，很容易后面出现包导入失败，可以运行到哪个代码缺少模块再开始安装~）

生信技能树

2025/03/28

2780

python单细胞学习笔记-day9(发在Science的celltypist软件注释)

python 单细胞scanpy流程

var 表格可视化数据 python

这篇推文耗时甚久，主要是学习和跑通官网代码，其次是加了一些自己的细微调整，比如整理marker基因表格，还有可视化的调整等。

用户11414625

2024/12/20

6470

单细胞分析的 Python 包 Scanpy（图文详解）

数据分析 html5

线粒体基因的转录本比单个转录物分子大，并且不太可能通过细胞膜逃逸。因此，检测出高比例的线粒体基因，表明细胞质量差（Islam et al. 2014; Ilicic et al. 2016）。

白墨石

2021/07/16

5.5K1

单细胞Scanpy流程学习和整理(分析簇间差异基因/细胞注释/数据保存)

数据分析

上一篇推文介绍了Scanpy流程中的10X数据读取/过滤/降维/聚类步骤，这次笔者将学习一下差异分析/细胞注释/数据保存。

凑齐六个字吧

2024/09/26

1.3K0

单细胞注释记不住marker怎么办--让AI帮你解释差异基因

模型数据人工智能 marker 工具

众所周知，单细胞数据分析最磨人的环节就是确定不同的单细胞亚群的生物学名字，而且每个领域都有自己的特殊性，所以我们耗费了大量的时间做了一些亚群及其对应的marker的整理，如下所示；

生信技能树

2024/11/21

3150

Scanpy进行单细胞分析及发育轨迹推断

python 数据分析开源

最近看文献，发现越来越多的单细胞测序使用scanpy进行轨迹推断，可能因为scanpy可以在整体umap或者Tsne基础上绘制细胞发育路径，图片也更加美观，但是Scanpy是基于python开发的，下面整理下Scanpy官网给出的流程，按照官网流程跑一遍PBMC的数据。

生信技能树jimmy

2020/12/24

4.3K0

做单细胞和空转必须了解的AnnData数据结构

对象可视化数据数据结构 pca

图解如下：https://anndata.readthedocs.io/en/stable/index.html

生信技能树

2025/04/09

4300

🤩 scanpy | 超高速完成单细胞分析！~（一）（预处理与聚类）

数据统计 sample 对象可视化

看现在的趋势，未来python可能才是未来，R再不做出改变就要被取代了。🙂‍↕️

生信漫卷

2025/05/04

8610

python单细胞学习笔记-day6

学习笔记 python 表格镜像数据

https://cf.10xgenomics.com/samples/cell/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz

生信技能树

2025/02/05

1730

Scanpy分析全流程(含harmonypy整合/细胞周期矫正/双细胞检测及去除)

数据挖掘

本次走一遍Scanpy全流程，数据集仍为GSE188711，该数据集可至GEO官网下载。

凑齐六个字吧

2025/06/01

4901

Scanpy分析全流程(含harmonypy整合/细胞周期矫正/双细胞检测及去除)

单细胞测序最好的教程（六）：细胞类型注释

数据库 cell marker 教程数据

作者按本教程将是本系列教程中最重要的一章，我们后续所有的单细胞分析，都要基于准确的细胞类型注释。本系列教程首发于“[单细胞最好的中文教程](single_cell_tutorial Readthedocs[1])”，未经授权许可，禁止转载。全文字数|预计阅读时间: 4500|5min ——Starlitnightly

生信菜鸟团

2023/08/23

1.9K0

多样本数据的自动注释-harmony和celltypist

数据 data import iteration var

如上所见，工作目录下的04_data文件夹下有3个子文件夹，每个文件夹里是3个标准文件

用户11414625

2024/12/20

1810

基于python的scanpy模块的乳腺癌单细胞数据分析

python go

这次我们来复现一篇单细胞的文章。这篇我们只来复现细胞图谱和拟时序分析像细胞通讯，还有富集分析还是很简单的。大家可以继续走下去，然后我们来交流讨论！这篇全篇基于python复现。

生信技能树

2021/10/12

4.1K1

python单细胞自动注释工具celltypist

搜索 python label 工具数据

1.R包和数据准备 import pandas as pd import numpy as np import scanpy as sc import matplotlib.pyplot as plt GSM6736629_10x-PBMC-1_ds0.1974_CountMatrix.tsv.gz文件在GEO直接搜索编号即可下载。 ### step1：读入表达量矩阵，构建单细胞对象 ------ path ='GSM6736629_10x-PBMC-1_ds0.1974_CountMatrix.tsv.gz' adata=sc.read_text(path).T #.T是转置 adata.shape (4557, 33104) adata.var_names = adata.var_names.str.split(':').str[0]#行名不规范，只想留下前半段 adata.var_names_make_unique() adata.var.head()

用户11414625

2024/12/20

2180

百万级别的单细胞数据是怎样完成注释的(python)

python 存储 self 对象数据

adata.h5ad中存储的是已经初次注释好的对象adata。其中self_annotation列是细胞类型注释。

用户11414625

2025/03/06

1850

scanpy教程：预处理与聚类

生物基因

scanpy 是一个用于分析单细胞转录组（single cell rna sequencing)数据的python库，文章2018发表在Genome Biology(https://genomebiology.biomedcentral.com/)。其实它的许多分析思路借鉴了以seurat为中心的R语言单细胞转录数据分析生态的，scanpy以一己之力在python生态构建了单细胞转录组数据分析框架。我相信借助python的工业应用实力，其扩展性大于R语言分析工具。当然，选择走一遍scanpy的原因，不是因为它的强大，只是因为喜欢。

生信技能树jimmy

2020/04/08

15.2K2

scanpy教程：PAGA轨迹推断

生物基因

如果说单细胞转录组数据分析中的分群是寻找细胞的离散属性，那么轨迹推断就是寻找细胞分化连续性的尝试。为什么细胞的分化既有离散性又有连续性呢？这是一个历史问题，细胞的分化当然是连续的，之所以用分群的方法来解释异质性，实在是一种无奈之举。每一个细胞都是独一无二的，没有一个细胞是孤岛，这是我们的口号，但是理想与现实总是不能统一。

生信技能树jimmy

2020/04/08

7.5K0

使用国产大模型完成单细胞自动注释

api 服务模型配置数据

哥伦比亚大学梅尔曼公共卫生学院（Columbia University Mailman School of Public Health）的 Wenpin Hou 和杜克大学医学院（Duke University School of Medicine）的 Zhicheng Ji 证明，大语言模型 GPT-4 可以在单细胞 RNA 测序分析中使用标记基因信息准确注释细胞类型。

生信技能树

2024/05/31

3820

单细胞测序最好的教程（十）：万能的Transformer与细胞注释

迁移 cell 教程模型数据

迁移注释实际上是自动注释的一类，与我们在3-4中介绍的自动注释不同，迁移注释需要我们有一个已经注释好的单细胞测序数据文件，而不是训练好的模型或者marker。我们需要使用这个已经注释好的单细胞测序数据文件来训练一个新的模型，例如Cell-Blast或者是Tosica等。

生信技能树jimmy

2023/09/09

1.7K0

课后补充----关于单细胞空间基础分析的代码部分

数据分析

追风少年i

2024/08/02

2700