生信分析过程中有句话叫:越上游,数据分析越自由!什么意思呢?就是任何一个组学的数据分析,我们对于其是怎么来的,怎么做的实验,了解的越多,那么下游的数据分析过程就越轻松(能够理解自己的数据结果可能会出现这样,那样各种指标异常的原因)。每次数据分析遇到质量问题,一定是一个自下而上追溯到实验甚至如何取样的过程~
前面我们学习了 定量后的标准分析:
现在来看看上游 cellranger atac定量吧,下次再看实验原理,再下次就看文献中的应用!
软件下载地址:https://support.10xgenomics.com/single-cell-atac/software/downloads/latest?
最新版本为:Cell Ranger ATAC - 2.1.0 (April 4, 2022)
# 下载
wget -O cellranger-atac-2.1.0.tar.gz "https://cf.10xgenomics.com/releases/cell-atac/cellranger-atac-2.1.0.tar.gz?Expires=1744846272&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1hdGFjL2NlbGxyYW5nZXItYXRhYy0yLjEuMC50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE3NDQ4NDYyNzJ9fX1dfQ__&Signature=Ney1Tx7XEyH3hnRSn1xeOIWFazbRu6KUy7fEf087WB1qA4rM9aXUdX1x9wi3d48nTFeGap66CwjpG3m2ECL66upEj1bX1uDn6hMkA9c8g49xR7H4lfyhCS5OLQaJyvxSei9v37Pbjwi66Che0obpEcdQM0d1pE06w7T607u1o3lR6pdjR2A0Dwf-NI01xVwH8Ex9QGarXiX4v97xy3YOdGmVjLPew~v3pmy5UQd8uX0JmEBR5tUyGlCRbteb98hCpqgcNlPAU18BgMGR92Mw3-fUammLm3szpsBBAjl29tqhKNDIoBfW8YDDyfrJtwgE4sgiHSSsgnVBPXhJBiEX~w__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA"
# 解压
tar -zxvf cellranger-atac-2.1.0.tar.gz
cd cellranger-atac-2.1.0/
# 查看是否安装成功
# /nas1/zhangj/biosoft/cellranger-atac-2.1.0
./cellranger-atac -h
显示如下:调用帮助文档成功

这次我们选择10x genomic官网的fq数据进行练习,物种为人,所以就下载人的参考基因组吧。
下载链接:https://support.10xgenomics.com/single-cell-atac/software/downloads/latest
# 下载参考基因组
wget https://cf.10xgenomics.com/supp/cell-atac/refdata-cellranger-arc-GRCh38-2020-A-2.0.0.tar.gz
# 解压
tar -zxvf refdata-cellranger-arc-GRCh38-2020-A-2.0.0.tar.gz
# 简单看一下目录
tree -L 2

选择一个外周血的:https://www.10xgenomics.com/datasets/10k-human-pbmcs-atac-v2-chromium-x-2-standard
# 下载
mkdir 10k-human-pbmcs-atac
cd 10k-human-pbmcs-atac
# Input Files
axel -n 100 https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_fastqs.tar
# 解压
tar -xvf 10k_pbmc_ATACv2_nextgem_Chromium_X_fastqs.tar
fq文件如下:
atac_pbmc_10k_nextgem_S1_L001_I1_001.fastq.gz
atac_pbmc_10k_nextgem_S1_L001_R1_001.fastq.gz
atac_pbmc_10k_nextgem_S1_L001_R2_001.fastq.gz
atac_pbmc_10k_nextgem_S1_L001_R3_001.fastq.gz
atac_pbmc_10k_nextgem_S1_L002_I1_001.fastq.gz
atac_pbmc_10k_nextgem_S1_L002_R1_001.fastq.gz
atac_pbmc_10k_nextgem_S1_L002_R2_001.fastq.gz
atac_pbmc_10k_nextgem_S1_L002_R3_001.fastq.gz
cellranger-atac count \
--id 10k_pbmc \
--reference refdata-cellranger-arc-GRCh38-2020-A-2.0.0 \
--fastqs 10k-human-pbmcs-atac \
--sample atac_pbmc_10k_nextgem \
--localcores 64 \
--localmem 128
我这里还没有跑完,先看看官网的。如果没有条件做上游,也可以直接下载10x官网提供的结果:
# Output Files
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_analysis.tar.gz
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_filtered_peak_bc_matrix.tar.gz
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_filtered_peak_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_filtered_tf_bc_matrix.tar.gz
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_filtered_tf_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_peak_annotation.tsv
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_peak_motif_mapping.bed
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_peaks.bed
wget https://s3-us-west-2.amazonaws.com/10x.files/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_possorted_bam.bam
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_possorted_bam.bam.bai
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_raw_peak_bc_matrix.tar.gz
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_raw_peak_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_singlecell.csv
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_summary.csv
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_summary.json
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_web_summary.html
wget https://cf.10xgenomics.com/samples/cell-atac/2.1.0/10k_pbmc_ATACv2_nextgem_Chromium_X/10k_pbmc_ATACv2_nextgem_Chromium_X_cloupe.cloupe
web_summary.html 文件跟 cellranger count 结果一样,也是非常非常非常重要的数据质量评估结果:
关于这个文件的说明可以参考官方文档,说的非常详细:
https://www.10xgenomics.com/support/epi-atac/documentation/steps/sequencing/interpreting-cell-ranger-atac-web-summary-files-for-single-cell-atac-assay
Sequenced read pairs:样本测序数据量,双端数据reads数,显示的是客户自己决定的数据量,这个图片中为 466894746/1000000=466.8947M;Valid barcodes:过滤了黑名单后的barcode中的 read pairs 占总 read pairs的比例,要求 >75%。低百分比暗示测序质量问题或者文库构建存在问题;Q30 bases in barcode:barcode read (i2)中碱基质量值Q>=30的比例,要求>65%,低的Q值暗示测序问题如 文库的次优加载浓度;Q30 bases in read 1:read 1碱基Q>=30的比例,要求>65%,Q30 bases in read 2:read 2碱基Q>=30的比例,要求>65%,Q30 bases in sample index i1:read (i1) 碱基Q>=30的比例,要求>90%
Estimated number of cell:鉴定为cell的barcode数,理想范围 500-10,000,数据可在 ±20% 范围内波动,高于或低于这个数暗示 细胞核计数不准确、细胞核裂解或在GEM生成过程中出现故障;Mean raw read pairs per cell:总 read pairs 数除以cell barcodes的数量,这个值取决于测序深度;Fraction of high-quality fragments in cells:与含有细胞的partitions分区相关联的高质量片段的比例。高质量片段是指具有有效barcode中的read pairs,这些read pairs能够以≥30的比对质量(map Q)比对到核基因组,且不是嵌合体,也不是重复片段。要求 >40%;Fraction of transposition events in peaks in cells:要求 >15%;Median high-quality fragments per cell:每个细胞barcode中高质量fragments中位数,取决于测序数深度或细胞类型;
文库复杂度相关指标。
该指标用于衡量在测序数据中,高质量read pairs中被认为是PCR扩增重复的比例,即那些具有有效barcode且比对到相同基因组位置的 read pair所占的比例。

read与参考基因组比对详细情况:
Confidently mapped read pairs:合理比对到参考基因组上的reads比例,要求 >80%;Unmapped read pairs:没有比对到参考基因组上的reads比例,要求 <5%;Non-nuclear read pairs:非核参考基因组reads比例,要求 <20%;Fragments in nucleosome-free regions:要求 >40%;Fragments flanking a single nucleosome:通过所有过滤条件且片段大小在124到296碱基对之间的片段比例。单核小体片段是指长度约为147个碱基对的DNA片段,通常来源于细胞核小体结构。如果这类片段的比例显著增加,可能暗示样本中存在以下问题:一是细胞在采集或处理过程中已经死亡或处于垂死状态,导致核小体结构被释放;二是样本可能被粒细胞(一种白细胞)污染,因为粒细胞的DNA也会以类似核小体的形式存在。
Number of peaks:检测到的peaks数,>45,000 ;Fraction of genome in peaks:在主要连续序列(contigs)中被定义为峰的碱基所占的比例。要求 >2% and <20%;TSS enrichment score:TSS富集打分,要求 >5;Fraction of high-quality fragments overlapping TSS:在具有细胞条形码的高质量片段中,有多少比例的片段与基因的转录起始位点(TSS)重叠。要求 >15%;Fraction of high-quality fragments overlapping peaks:该指标表示在具有细胞条形码的高质量片段中,有多少比例的片段与通过分析鉴定出的峰(例如转录因子结合峰等)重叠。它反映了测序数据与已鉴定的生物学信号(如基因调控元件或转录活性区域)的关联程度。较高的比例通常意味着数据具有较高的生物学相关性,能够更好地反映细胞的转录组或基因组特征。低百分比表明片段并非来自已鉴定的峰,而是来自基因组的随机区域。原因可能包括死细胞或测序深度极低。要求 >15%。