文章/答案/技术大牛

发布

社区首页 >专栏 >sc-DRS: 连接疾病GWAS与单细胞数据的分析方式

sc-DRS: 连接疾病GWAS与单细胞数据的分析方式

原创

凑齐六个字吧

发布于 2025-04-05 02:08:48

13400

代码可运行

文章被收录于专栏：基因组基因组单细胞

运行总次数：0

代码可运行

前言

近几年，孟德尔随机化（Mendelian Randomization, MR）方法在疾病因果推断研究中备受关注。MR利用基因变异作为工具变量，推断暴露因子与疾病之间的因果关系，避免传统观察性研究中的混杂因素影响。然而，MR通常在群体水平上进行分析，难以解析疾病在不同细胞类型中的作用机制。

为此，将疾病GWAS数据与单细胞RNA测序（scRNA-seq）数据相结合，可以帮助我们更精细地探究疾病的细胞类型特异性关联，揭示潜在的致病细胞群及分子机制。sc-DRS（Single-cell Disease Relevance Score）正是这样一个强大的分析工具，能够基于GWAS基因集计算单细胞的疾病相关性评分（DRS），为探索疾病的细胞生物学基础提供新的思路。

scDRS的计算原理

图片来源:Zhang MJ, Hou K, Dey KK, et al. Polygenic enrichment distinguishes disease associations of individual cells in single-cell RNA-seq data. Nat Genet. 2022;54(10):1572-1580. doi:10.1038/s41588-022-01167-z

sc-DRS分析能得到什么结果？

识别与特定疾病相关的细胞群，揭示疾病的潜在细胞靶点

解析细胞亚型与疾病的相关性，探索特定细胞亚型在疾病进程中的作用

如何使用scDRS

1.下载scDRS包并进行安装

git clone https://github.com/martinjzhang/scDRS.git
cd scDRS
pip install -e .

快速测试是否安装成功

python -m pytest tests/test_CLI.py -p no:warnings

2.下载使用测试数据

2.1 我们使用开发者提供的数据进行测试分析

wget https://figshare.com/ndownloader/files/34300925 -O data.zip
unzip data.zip && \
mkdir -p data/ && \
mv single_cell_data/test/* data/ && \
rm data.zip && rm -r single_cell_data/test/

加载需要的组件

import scdrs
import scanpy as sc
sc.set_figure_params(dpi=125)
from anndata import AnnData
from scipy import stats
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import warnings

warnings.filterwarnings("ignore")

3.数据加载和展示

使用sc-DRS进行分析需要三个文件：单细胞数据（h5ad）；GWAS统计数据；协变量文件（.h5ad 单细胞数据的scDRS协变量文件）。具体文件格式和内容参考：https://martinjzhang.github.io/scDRS/file_format.html

3.1 这里，我们使用作者提供的单细胞数据（小鼠来源的单细胞测序数据）

adata =  sc.read_h5ad("scDRS/data/expr.h5ad")

#改成自己下载的.h5ad文件所在路径

3.2 使用以下3个基因组数据作为GWAS统计数据

SCZ：从 GWAS 汇总统计数据中获得的精神分裂症 (SCZ) 基因集，这是本演示中感兴趣的疾病。

Dorsal：从 Cembrowski et al.2016 等人获得的背侧 CA1 锥体中的差异表达基因，我们将用它来构建背部评分，表明与背侧 CA1 的接近度，我们发现这有助于了解 SCZ 疾病富集的空间分布

Height：从 GWAS 汇总统计数据中获得的身高基因集，我们将其用作负控制特征。

df_gs = pd.read_csv("single_cell_data/test/geneset.gs", sep="\t", index_col=0)`#改成自己下载的geneset.gs文件所在路径

df_gs = df_gs.loc[
    [
        "PASS_Schizophrenia_Pardinas2018",
        "spatial_dorsal",
        "UKB_460K.body_HEIGHTz",
    ],
    :,
].rename(
    {
        "PASS_Schizophrenia_Pardinas2018": "SCZ",
        "spatial_dorsal": "Dorsal",
        "UKB_460K.body_HEIGHTz": "Height",
    }
)
display(df_gs)

df_gs.to_csv("data/processed_geneset.gs", sep="\t")

作者在geneset.gs中提供了29个GWAS数据集，有兴趣的也可以选择其他数据集进行分析（只需要根据数据集中TRAIT的名字对 "PASS_Schizophrenia_Pardinas2018", "spatial_dorsal", "UKB_460K.body_HEIGHTz", 进行修改即可）。

此外，我们也可以根据自己想要研究的疾病准备GWAS数据集，我们会在下一篇章中进行讲解。

4.使用scdrs计算分数并评估疾病对不同细胞的富集程度

4.1 进行评分

scdrs compute-score \
    --h5ad-file /Users/han/scDRS/data/expr.h5ad \
    --h5ad-species mouse \
    --gs-file /Users/han/single_cell_data/test/processed_geneset_capitalized.gs \
    --gs-species mouse \
    --cov-file /Users/han/scDRS/data/cov.tsv \
    --flag-filter-data True \
    --flag-raw-count True \
    --flag-return-ctrl-raw-score False \
    --flag-return-ctrl-norm-score True \
    --out-folder /Users/han/single_cell_data/test/

#h5ad-file；gs-file；cov-file 需要更改相关文件所在路径。此外作者所使用的单细胞数据为小鼠来源的，因此h5ad-species为mouse，相关gs-species也需要相对应的改为mouse。大家需要根据自己所使用的数据样本来源进行更改。

4.2 可视化scDRS结果

    trait: pd.read_csv(f"single_cell_data/test/{trait}.full_score.gz", sep="\t", index_col=0)
    for trait in df_gs.index
}

for trait in dict_score:
    adata.obs[trait] = dict_score[trait]["norm_score"]

sc.set_figure_params(figsize=[3, 3], dpi=300)
sc.pl.umap(
    adata,
    color="level1class",
    ncols=1,
    color_map="RdBu_r",
    vmin=-5,
    vmax=5,
)

sc.pl.umap(
    adata,
    color=dict_score.keys(),
    color_map="RdBu_r",
    ncols=3,
    vmin=-5,
    vmax=5,
    s=20,
)

我们可以看到SCZ疾病主要富集在Pyramidal CA1细胞

如果需要直接保存图片可以在sc.pl.umap命令的最后一行中加上save="***.png"

5.分析疾病在不同细胞类型中的表达情况

5.1 对SCZ和Height在不同细胞间的表达进行评分

for trait in ["SCZ", "Height"]:
    !scdrs perform-downstream \
        --h5ad-file scDRS/data/expr.h5ad \
        --score-file single_cell_data/test/{trait}.full_score.gz \
        --out-folder single_cell_data/test/ \
        --group-analysis level1class \
        --flag-filter-data True \
        --flag-raw-count True

*同样的，需要更改数据所在的位置

5.2 查看SCZ在不同细胞中的表达情况：

!cat single_cell_data/test/SCZ.scdrs_group.level1class | column -t -s $'\t'

5.3可视化疾病在不同细胞类型中的表达情况：

1.每种细胞类型-疾病对的热图颜色表示显著相关细胞的比例。

2.方块表示显著的细胞类型-疾病关联（FDR所有细胞类型和疾病/特征对中均为 0.05。

3.十字符号表示特定细胞类型内单个细胞之间与疾病相关的显著异质性。

dict_df_stats = {
    trait: pd.read_csv(f"/Users/han/single_cell_data/test/{trait}.scdrs_group.level1class", sep="\t", index_col=0)
    for trait in ["SCZ", "Height"]
}

dict_celltype_display_name = {
    "pyramidal_CA1": "Pyramidal CA1",
    "oligodendrocytes": "Oligodendrocyte",
    "pyramidal_SS": "Pyramidal SS",
    "interneurons": "Interneuron",
    "endothelial-mural": "Endothelial",
    "astrocytes_ependymal": "Astrocyte",
    "microglia": "Microglia",
}

fig, ax = scdrs.util.plot_group_stats(
    dict_df_stats={
        trait: df_stats.rename(index=dict_celltype_display_name)
        for trait, df_stats in dict_df_stats.items()
    },
    plot_kws={
        "vmax": 0.2,
        "cb_fraction":0.12
    }
)


ax.grid(False) 

plt.show()

可以看到SCZ和Height在 CA1 pyramidal中的差异是最明显的。接下来，可以提取CA1 pyramidal细胞进行重新聚类，并进一步探究异质性的来源。

6.探索不同疾病或表型与某种细胞亚型的关系

#提取 CA1 pyramidal并进行重新聚类
adata_ca1 = adata[adata.obs["level2class"].isin(["CA1Pyr1", "CA1Pyr2"])].copy()
sc.pp.filter_cells(adata_ca1, min_genes=0)
sc.pp.filter_genes(adata_ca1, min_cells=1)
sc.pp.normalize_total(adata_ca1, target_sum=1e4)
sc.pp.log1p(adata_ca1)

sc.pp.highly_variable_genes(adata_ca1, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata_ca1 = adata_ca1[:, adata_ca1.var.highly_variable]
sc.pp.scale(adata_ca1, max_value=10)
sc.tl.pca(adata_ca1, svd_solver="arpack")

sc.pp.neighbors(adata_ca1, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata_ca1, n_components=2)

# 匹配 scDRS 分数
for trait in dict_score:
    adata_ca1.obs[trait] = dict_score[trait]["norm_score"]
sc.pl.umap(
    adata_ca1,
    color=dict_score.keys(),
    color_map="RdBu_r",
    vmin=-5,
    vmax=5,
    ncols=3,
    s=20,
    save="Subgroup-Disease.png"
)

7.评估疾病与某种细胞之间的相关性

与之前的观察一致，SCZ 似乎在 CA1 的背侧部分富集。当我们绘制 SCZ 疾病平均评分与背部五分位数的关系图时，这一点更加明显。

df_plot = adata_ca1.obs[["Dorsal", "SCZ", "Height"]].copy()
df_plot["Dorsal quintile"] = pd.qcut(df_plot["Dorsal"], 5, labels=np.arange(5))

fig, ax = plt.subplots(figsize=(3.5, 3.5))
for trait in ["SCZ", "Height"]:
    sns.lineplot(
        data=df_plot,
        x="Dorsal quintile",
        y=trait,
        label=trait,
        err_style="bars",
        marker="o",
        ax=ax,
    )
ax.set_xticks(np.arange(5))
ax.set_xlabel("Dorsal quintile")
ax.set_ylabel("Mean scDRS disease score")
fig.show()

此外，我们可以将疾病评分和对照评分与背部评分进行相关性分析（Pearson）。

spatial_col = "Dorsal"
for trait in ["SCZ", "Height"]:
    df_score = dict_score[trait].reindex(adata_ca1.obs.index)
    ctrl_cols = [col for col in df_score.columns if col.startswith("ctrl_norm_score")]
    n_ctrl = len(ctrl_cols)

    # Pearson's r between trait score and spatial score
    data_r = stats.pearsonr(df_score["norm_score"], adata_ca1.obs[spatial_col])[0]

    # Regression: control score ~ spatial score
    ctrl_r = np.zeros(len(ctrl_cols))
    for ctrl_i, ctrl_col in enumerate(ctrl_cols):
        ctrl_r[ctrl_i] = stats.pearsonr(df_score[ctrl_col], adata_ca1.obs[spatial_col])[
            0
        ]
    pval = (np.sum(data_r <= ctrl_r) + 1) / (n_ctrl + 1)

    print(f"{trait} v.s. {spatial_col}, Pearson's r={data_r:.2g} (p={pval:.2g})")

SCZ v.s. Dorsal, Pearson's r=0.43 (p=0.001) Height v.s. Dorsal, Pearson's r=0.04 (p=0.37)

总结

sc-DRS 提供了一种系统化的方法，将GWAS结果与单细胞转录组数据相结合，评估细胞与疾病的相关性。其主要功能包括：

•	基于GWAS导出的基因集，为每个单细胞计算疾病相关性评分（Disease Relevance Score, DRS）；
•	鉴定疾病富集的细胞亚群，揭示特定细胞在疾病发生发展中的作用；
•	支持不同疾病、多种组织、复杂细胞群体之间的系统性比较；
•	辅助发现潜在治疗靶点或生物标志物，加速疾病机制研究。

通过 sc-DRS，研究者能够更深入地理解复杂疾病的细胞类型特异性机制，为后续的功能验证、干预设计提供精准方向。这不仅拓展了GWAS研究的解释力，也推动了从遗传关联到细胞机制的研究转化。