scATAC-seq数据分析除了经典的 Signac软件,还有一款也适用差多的软件 ArchR,官网给到你:
https://www.archrproject.com/index.html
这个软件于2021年2月25号发表在 Nature Genetics 杂志上,文献标题为《ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis》。作者认为 ArchR 是目前最全面且用户友好的单细胞ATAC-seq和多组学数据分析工具。开发团队将继续改进和扩展ArchR的功能,以保持其在该领域的领先地位。
大家看文章的通讯作者就知道是 ATAC-seq 这个领域的大佬了:William J. Greenleaf
ArchR分析流程:
续上一篇:ATAC-seq领域超级大佬William J. Greenleaf团队开发的scATAC-seq分析软件:ArchR
rm(list=ls())
library(ArchR)
set.seed(1)
addArchRThreads(threads = 60)
## Setting default number of Parallel threads to 1.
addArchRGenome("hg19")
## Setting default genome to Hg19.
# addArchRGenome("hg38")
####### 示例数据下载
library(parallel)
inputFiles <- getTutorialData("Hematopoiesis")
inputFiles
names(inputFiles)
将上一期保存的数据加载进来:
list.files(path = "./Save-ProjHeme1")
# load
projHeme1 <- loadArchRProject(path = "./Save-ProjHeme1")
projHeme1
# 过滤
projHeme2 <- filterDoublets(projHeme1,filterRatio = 1.5)
# Filtering 614 cells from ArchRProject!
# scATAC_BMMC_R1 : 364 of 4932 (7.4%)
# scATAC_CD34_BMMC_R1 : 160 of 3275 (4.9%)
# scATAC_PBMC_R1 : 90 of 2453 (3.7%)
# https://www.archrproject.com/bookdown/filtering-doublets-from-an-archrproject.html
scATAC-seq数据的特点:
低信息含量使得scATAC-seq数据稀疏,增加了降维分析的难度。
对稀疏的插入counts matrix 进行标准降维(如PCA)会导致细胞间高度相似,因为大多数位置的插入counts为0。为了解决这个问题,作者采用分层降维方法。首先,使用Latent Semantic Indexing (LSI)算法,这是一种最初为自然语言处理设计的用于基于词频评估文档相似性的方法。LSI最初由Cusanovich等人(Science 2015)引入用于scATAC-seq。在scATAC-seq的情况下,不同的样本是文档,不同的区域/峰是词。
首先,通过深度归一化计算每个单细胞的词频。然后,这些值通过逆文档频率进行归一化,逆文档频率通过特征出现的频率来加权特征,以识别更“特定”的特征,而不是常见的可接近区域。结果得到的词频-逆文档频率(term frequency-inverse document frequency,TF-IDF)矩阵反映了词(即区域/峰)对文档(即样本)的重要性。
最后,通过奇异值分解(SVD),识别并表示跨样本的最有价值的信息到一个低维空间中。LSI允许你将稀疏插入counts matrix 的维度从数千降低到数十或数百。然后,使用UMAP 或 t-SNE 来可视化数据。
这里作者引入的是迭代LSI方法,迭代LSI方法通过逐步细化特征选择,提高了scATAC-seq数据降维的准确性和可重复性,减少了批次效应的影响。
在ArchR中,addIterativeLSI()
函数用于执行迭代潜在语义索引(LSI)。默认参数适用于大多数情况,但用户可以根据需要进行调整。
参数调整建议:鼓励用户探索addIterativeLSI()
函数的可用参数,以了解它们如何影响特定数据集。
iterations
:指定迭代次数,决定了LSI过程的精细程度;varFeatures
:指定用于LSI的可变特征数量,影响特征选择的严格性;resolution
:影响聚类的分辨率,决定了聚类的细致程度。在本教程中,创建一个名为“IterativeLSI”的降维对象:
projHeme2 <- addIterativeLSI(
ArchRProj = projHeme2,
useMatrix = "TileMatrix",
name = "IterativeLSI",
iterations = 2,
clusterParams = list( #See Seurat::FindClusters
resolution = c(0.2),
sampleCells = 10000,
n.start = 10
),
varFeatures = 25000,
dimsToUse = 1:30
)
# list the available reducedDims objects in an ArchRProject
projHeme2@reducedDims
大概运行3mins,日志如下:
Checking Inputs...
ArchR logging to : ArchRLogs/ArchR-addIterativeLSI-16f1e31648168-Date-2025-05-10_Time-22-37-14.964377.log
If there is an issue, please report to github with logFile!
2025-05-10 22:37:15.584472 : Computing Total Across All Features, 0.002 mins elapsed.
2025-05-10 22:37:20.395726 : Computing Top Features, 0.082 mins elapsed.
###########
2025-05-10 22:37:21.528287 : Running LSI (1 of 2) on Top Features, 0.101 mins elapsed.
###########
2025-05-10 22:37:21.620972 : Sampling Cells (N = 10002) for Estimated LSI, 0.103 mins elapsed.
2025-05-10 22:37:21.621883 : Creating Sampled Partial Matrix, 0.103 mins elapsed.
2025-05-10 22:37:27.816726 : Computing Estimated LSI (projectAll = FALSE), 0.206 mins elapsed.
2025-05-10 22:38:51.100529 : Identifying Clusters, 1.594 mins elapsed.
2025-05-10 22:39:06.291588 : Identified 6 Clusters, 1.847 mins elapsed.
2025-05-10 22:39:06.309361 : Saving LSI Iteration, 1.848 mins elapsed.
2025-05-10 22:39:19.073252 : Creating Cluster Matrix on the total Group Features, 2.06 mins elapsed.
2025-05-10 22:39:26.023781 : Computing Variable Features, 2.176 mins elapsed.
###########
2025-05-10 22:39:26.210733 : Running LSI (2 of 2) on Variable Features, 2.179 mins elapsed.
###########
2025-05-10 22:39:26.229907 : Creating Partial Matrix, 2.18 mins elapsed.
2025-05-10 22:39:31.657187 : Computing LSI, 2.27 mins elapsed.
2025-05-10 22:40:26.851166 : Finished Running IterativeLSI, 3.19 mins elapsed.
运行过程中:
corCutOff
参数设置了相关性的阈值,超过该阈值的维度将被排除。特殊情况处理:
有时,迭代LSI方法对于强烈的批次效应差异来说校正得还不够。因此,ArchR实现了一个常用于校正批次效应的工具,名为Harmony,
可以直接将一个降维对象从ArchR传递给HarmonyMatrix()函数。
projHeme2 <- addHarmony(
ArchRProj = projHeme2,
reducedDims = "IterativeLSI",
name = "Harmony",
groupBy = "Sample"
)
## Transposing data matrix
## Initializing state using k-means centroids initialization
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony converged after 3 iterations
这个过程在projHeme2
对象中创建了一个名为“Harmony”的新的降维对象:
# list the available reducedDims objects in an ArchRProject
projHeme2@reducedDims
# names(2): IterativeLSI Harmony
大多数单细胞聚类方法侧重于在降维空间中计算最近邻图,然后识别“社区”或细胞聚类。这些方法效果非常好,已成为scRNA-seq的标准实践。因此,ArchR使用来自scRNA-seq软件包的现有的聚类方法来进行聚类。
projHeme2 <- addClusters(
input = projHeme2,
reducedDims = "IterativeLSI",
method = "Seurat",
name = "Clusters",
resolution = 0.8
)
# 获取聚类结果
head(projHeme2$Clusters)
table(projHeme2$Clusters)
为了更好地了解哪些样本位于哪些聚类中,我们可以使用confusionMatrix()
函数为每个样本创建一个聚类混淆矩阵:
cM <- confusionMatrix(paste0(projHeme2$Clusters), paste0(projHeme2$Sample))
cM
library(pheatmap)
cM <- cM / Matrix::rowSums(cM)
p <- pheatmap::pheatmap(
mat = as.matrix(cM),
color = paletteContinuous("whiteBlue"),
border_color = "black"
)
p
这个可以用于展示样本与聚类之间的对应关系:帮助研究人员评估聚类结果的准确性和样本的分布情况,有助于识别哪些样本可能被错误分类,或者哪些聚类可能包含多个不同的样本类型。
projHeme2 <- addClusters(
input = projHeme2,
reducedDims = "IterativeLSI",
method = "scran",
name = "ScranClusters",
k = 15
)
在ArchR中,UMAP和t-SNE等嵌入方法用于在降维空间中可视化单细胞。选择使用UMAP还是t-SNE取决于具体应用,但UMAP在多种应用中表现良好,是scATAC-seq数据的标准选择。
此外,输入参数对嵌入结果有显著影响。ArchR提供了一套默认的输入参数,适用于大多数应用,但没有一套参数能够满足所有数据集的需求。用户需要根据自己的数据特点调整输入参数,以获得最佳的可视化效果。
projHeme2 <- addUMAP(
ArchRProj = projHeme2,
reducedDims = "IterativeLSI",
name = "UMAP",
nNeighbors = 30,
minDist = 0.5,
metric = "cosine"
)
# list the available embeddings objects
projHeme2@embeddings
# 可视化
p1 <- plotEmbedding(ArchRProj = projHeme2, colorBy = "cellColData", name = "Sample", embedding = "UMAP")
p1
p2 <- plotEmbedding(ArchRProj = projHeme2, colorBy = "cellColData", name = "Clusters", embedding = "UMAP")
p2
ggAlignPlots(p1, p2, type = "h")
# 保存
plotPDF(p1,p2, name = "Plot-UMAP-Sample-Clusters.pdf", ArchRProj = projHeme2, addDOC = FALSE, width = 5, height = 5)
结果如下:
projHeme2 <- addTSNE(
ArchRProj = projHeme2,
reducedDims = "IterativeLSI",
name = "TSNE",
perplexity = 30
)
# list the available embeddings objects
projHeme2@embeddings
# 可视化
p1 <- plotEmbedding(ArchRProj = projHeme2, colorBy = "cellColData", name = "Sample", embedding = "TSNE")
p1
p2 <- plotEmbedding(ArchRProj = projHeme2, colorBy = "cellColData", name = "Clusters", embedding = "TSNE")
p2
ggAlignPlots(p1, p2, type = "h")
# 保存
plotPDF(p1,p2, name = "Plot-TSNE-Sample-Clusters.pdf", ArchRProj = projHeme2, addDOC = FALSE, width = 5, height = 5)
结果如下:
如果是运行了harmony,则降维聚类应该要选择 reducedDims 参数为 Harmony
projHeme2 <- addUMAP(
ArchRProj = projHeme2,
reducedDims = "Harmony",
name = "UMAPHarmony",
nNeighbors = 30,
minDist = 0.5,
metric = "cosine"
)
p3 <- plotEmbedding(ArchRProj = projHeme2, colorBy = "cellColData", name = "Sample", embedding = "UMAPHarmony")
p4 <- plotEmbedding(ArchRProj = projHeme2, colorBy = "cellColData", name = "Clusters", embedding = "UMAPHarmony")
ggAlignPlots(p3, p4, type = "h")
1
ArchR提供了一个highlightCells
参数,用于在嵌入图中突出显示特定的细胞:
plotEmbedding(
ArchRProj = projHeme2,
embedding = "UMAP",
colorBy = "cellColData",
name = "Clusters",
size = 1,
sampleCells = NULL,
highlightCells = getCellNames(ArchRProj = projHeme2)[which(projHeme2@cellColData$Clusters == "C1")],
baseSize = 10,
plotAs = "points")
高亮C1:
ArchR允许用户将其他软件包生成的嵌入坐标导入到ArchR项目中,通过正确命名嵌入坐标列,可以在ArchRProject中像使用其他嵌入一样使用自定义嵌入:
# simulate mock UMAP data by random sampling
umap1 <- (sample(x = -100000:100000, size = length(getCellNames(projHeme2)), replace = FALSE))/10000
umap2 <- (sample(x = -100000:100000, size = length(getCellNames(projHeme2)), replace = FALSE))/10000
# add embedding to projHeme2
umap_df <- DataFrame(row.names = getCellNames(projHeme2), "custom#UMAP1" = umap1, "custom#UMAP2" = umap2, check.names = FALSE)
head(umap_df)
projHeme2@embeddings$customUMAP <- SimpleList(df = umap_df, params = list())
# 获取坐标
getEmbedding(ArchRProj = projHeme2, embedding = "customUMAP")
本次分享到这,未完待续~
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有