工欲善其事 必先利其器
单细胞免疫组库 是一种基于单细胞测序技术的高精度研究方法,用于解析免疫细胞(如T细胞、B细胞)的抗原受体(TCR/BCR)序列及其转录组特征。传统bulk测序无法揭示细胞异质性,而单细胞技术可精准识别克隆型分布、分化轨迹及功能状态,为免疫机制研究提供分子级分辨率。
10X Genomics 通过其Chromium平台开发了专为免疫组库设计的解决方案(如10x Chromium Single-Cell V(D)J Assay),可在单细胞层面同时获取全长配对 V (D) J 序列、基因表达、细胞表面蛋白表达和抗原特异性数据。该检测方法将数千个细胞分隔到乳液包裹凝胶微珠(GEMs)中,单个液滴内生成的所有互补 DNA(cDNA)都具有相同的 10X Barcode。从 cDNA 构建文库并进行测序,根据10X Barcode将每条读取的序列追溯到其对应的单个cell,实现大规模单细胞水平免疫受体库的深度解析,支持克隆演化、抗原特异性响应及免疫治疗靶点发现。
同10X单细胞转录组,10X Genomics 也提供了完善的分析 pipeline 。
首先用的软件还是 cellranger ,前面也多次介绍,并且共享服务器也有提供现成的软件和参考基因组,可以直接调用。【服务器详情见:满足你生信分析计算需求的低价解决方案】
共享服务器
如果需要自己下载,也可以直接在官网下载:
wget -O cellranger-9.0.1.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-9.0.1.tar.gz?Expires=1742333081&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=aSXlUageSwcEjNHFjw6JGYuf6ccW7AAe5uVmOJ4YrH~fX28OCPGGb4q5pJ7tFYSSvz6f4vv-35D0G4tOLIvG7ptV8k8EQhjVFnVzU9OEY4yZ1qPNVcB0UR~qnRvego4~diZrKz0covGULCVNf~l4lmnPtlNnbM7Z2O4A2~x~2be3peRTchZ4eLC7PUJhVFFHPABx4fVqlMabd6g2wabJQuuxg37r1uXuxoOim3YlOs2lbN4Nm~LTpZ6BdK3M7-Ai9AOMCmwmut41mi-h32rJgZT9M0Ga~LpKatCgOysCHkwCQNe-CNyv1X~h44ipI18NCpvxoSN3VVJbb8SHzVQgBg__"
其次是所需的参考基因组文件
#human
wget "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz"
wget "https://cf.10xgenomics.com/supp/cell-vdj/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.1.0.tar.gz"
根据官方文档,我们可以很容易写出我们所需的定量脚本:
$cat ~/scRNAseq_pipeline/st3_cellranger9_vdj_multi.sh
#! /bin/bash -xe
#
bin=/home/username/software/cellranger-9.0.1/cellranger
/usr/bin/time -v $bin multi --id=${1} \
--csv=${2} \
--localcores=4 \
--localmem=50
从定量脚本,可以看出对于整个pipeline最主要的就是准备一个样本信息的csv文件。 根据文档信息:
如果少量样本,我们按示例格式手动修改即可。例如:
$cat C1008587.csv
[gene-expression]
reference,/home/username/reference/human/refdata-gex-GRCh38-2024-A
create-bam,false
[vdj]
reference,/home/username/reference/human/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.1.0
[libraries]
fastq_id,fastqs,feature_types
C1008587,/home/username/project/b_scrna,Gene Expression
C1008587_BCR,/home/username/project/b_rawdata/C1008587_BCR,VDJ-B
C1008587_TCR,/home/username/project/b_rawdata/C1008587_TCR,VDJ-T
## 单样本提交后台
nohup bash /home/usernam/scRNAseq_pipeline/st3_cellranger9_vdj_multi.sh C1008587 /home/usernam/project/cellrgr9/C1008587.csv 1>log_C1008587.txt 2>&1 &
如果有多个样本,那么再手动逐一修改csv文件就会显得很低效了。这时候就需要考虑如何快速批量生成样本对应的csv文件。
这边提供一个思路是:
libraries
首先根据自己的参考基因组和vdj参考文件新建一个head.txt
$cat head.txt
[gene-expression]
reference,/home/username/reference/human/refdata-gex-GRCh38-2024-A
create-bam,false
[vdj]
reference,/home/username/reference/human/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.1.0
[libraries]
fastq_id,fastqs,feature_types
然后批量生成csv文件后半部分样品信息:
前提条件:
GEX数据目录结构示例:
GEX数据
BCR\TCR目录结构示例:
BCR、TCR数据
根据gz文件目录结构信息,批量创建样本的csv文件,脚本命名为generate_csv.sh
$cat generate_csv.sh
#!/bin/bash
#
#Define variables
gex_gz=~/project/b_scrna/*gz
bcr_gz=~/project/b_rawdata/*BCR*/*BCR*.gz
tcr_gz=~/project/b_rawdata/*TCR*/*TCR*.gz
headfile=~/project/b_scrna/head.txt
out_dir=~/project/c_multi_outs
#GEX data
ls ${gex_gz}|grep -E -v "BCR|TCR"|awk -F'/' '
{
full_dir_path = $0
sub(/\/[^\/]+$/, "", full_dir_path)
split($NF, parts, "_S1")
print parts[1] "," full_dir_path
}'|sort -u|awk '{print $0 ",Gene Expression"}' > ge_data.txt
#BCR data
ls ${bcr_gz} | awk -F'/' '
{
full_dir_path = $0
sub(/\/[^\/]+$/, "", full_dir_path)
split($NF, parts, "_S1")
print parts[1] "," full_dir_path
}'|sort -u|awk '{print $0 ",VDJ-B"}' > bcr_data.txt
#TCR data
ls ${tcr_gz} | awk -F'/' '
{
full_dir_path = $0
sub(/\/[^\/]+$/, "", full_dir_path)
split($NF, parts, "_S1")
print parts[1] "," full_dir_path
}'|sort -u|awk '{print $0 ",VDJ-T"}' > tcr_data.txt
paste ge_data.txt bcr_data.txt tcr_data.txt > data_list.txt
# Check whether the head file exists
if [ ! -f ${headfile} ]; then
echo "head file does not exist."
exit 1
fi
#Create a CSV file of each sample
cat data_list.txt |while read id; do
sample=$(echo "$id" | cut -d "," -f 1)
cp ${headfile} ${out_dir}/${sample}.csv
printf "%s\n" "$id" | xargs -n 1 -d "\t" >> ${out_dir}/${sample}.csv
done
## Delete temporary files
rm data_list.txt ge_data.txt bcr_data.txt tcr_data.txt
echo "All CSV files have been generated."
注:不一定适合所有情况,可根据自己数据的目录结构修改脚本。上述代码不是最优解,也可自己重写。
每个样本一个csv文件
检查CSV文件,确保符合程序要求。
$cat C1008587.csv
[gene-expression]
reference,/home/username/reference/human/refdata-gex-GRCh38-2024-A
create-bam,false
[vdj]
reference,/home/username/reference/human/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.1.0
[libraries]
fastq_id,fastqs,feature_types
C1008587,/home/username/project/b_scrna,Gene Expression
C1008587_BCR,/home/username/project/b_rawdata/C1008587_BCR,VDJ-B
C1008587_TCR,/home/username/project/b_rawdata/C1008587_TCR,VDJ-T
$cat C1010319.csv
[gene-expression]
reference,/home/username/reference/human/refdata-gex-GRCh38-2024-A
create-bam,false
[vdj]
reference,/home/username/reference/human/refdata-cellranger-vdj-GRCh38-alts-ensembl-7.1.0
[libraries]
fastq_id,fastqs,feature_types
C1010319,/home/username/project/b_scrna,Gene Expression
C1010319_BCR,/home/username/project/b_rawdata/C1010319_BCR,VDJ-B
C1010319_TCR,/home/username/project/b_rawdata/C1010319_TCR,VDJ-T
提交任务的时候需要根据服务器可用资源来调整调用的内存和线程,以及是分批提交,还是一次性全部提交。
##多样本分批执行
ls *.csv|cut -d "." -f 1 |awk '{print "bash /home/username/scRNAseq_pipeline/st3_cellranger9_vdj_multi.sh " $1 " /home/username/project/c_multi_outs/"$1".csv 1>log_"$1".txt 2>&1"}' > screen_run.sh
screen -R vdj
cat screen_run.sh |xargs -iF -P 3 sh -c "F"
正常运行结束,每个样本ID目录下面会有一个outs目录,outs目录里面就是我们需要的结果文件。
$tree -L 5
.
├── config.csv
├── multi
│ ├── count
│ │ ├── raw_cloupe.cloupe
│ │ ├── raw_feature_bc_matrix
│ │ │ ├── barcodes.tsv.gz
│ │ │ ├── features.tsv.gz
│ │ │ └── matrix.mtx.gz
│ │ ├── raw_feature_bc_matrix.h5
│ │ └── raw_molecule_info.h5
│ ├── vdj_b
│ │ ├── all_contig_annotations.bed
│ │ ├── all_contig_annotations.csv
│ │ ├── all_contig_annotations.json
│ │ ├── all_contig.bam
│ │ ├── all_contig.bam.bai
│ │ ├── all_contig.fasta
│ │ ├── all_contig.fasta.fai
│ │ └── all_contig.fastq
│ └── vdj_t
│ ├── all_contig_annotations.bed
│ ├── all_contig_annotations.csv
│ ├── all_contig_annotations.json
│ ├── all_contig.bam
│ ├── all_contig.bam.bai
│ ├── all_contig.fasta
│ ├── all_contig.fasta.fai
│ └── all_contig.fastq
├── per_sample_outs
│ └── C1008587
│ ├── count
│ │ ├── analysis
│ │ │ ├── clustering
│ │ │ ├── diffexp
│ │ │ ├── pca
│ │ │ ├── tsne
│ │ │ └── umap
│ │ ├── sample_cloupe.cloupe
│ │ ├── sample_filtered_barcodes.csv
│ │ ├── sample_filtered_feature_bc_matrix
│ │ │ ├── barcodes.tsv.gz
│ │ │ ├── features.tsv.gz
│ │ │ └── matrix.mtx.gz
│ │ ├── sample_filtered_feature_bc_matrix.h5
│ │ └── sample_molecule_info.h5
│ ├── metrics_summary.csv
│ ├── vdj_b
│ │ ├── airr_rearrangement.tsv
│ │ ├── cell_barcodes.json
│ │ ├── clonotypes.csv
│ │ ├── concat_ref.bam
│ │ ├── concat_ref.bam.bai
│ │ ├── concat_ref.fasta
│ │ ├── concat_ref.fasta.fai
│ │ ├── consensus_annotations.csv
│ │ ├── consensus.bam
│ │ ├── consensus.bam.bai
│ │ ├── consensus.fasta
│ │ ├── consensus.fasta.fai
│ │ ├── donor_regions.fa
│ │ ├── filtered_contig_annotations.csv
│ │ ├── filtered_contig.fasta
│ │ ├── filtered_contig.fastq
│ │ ├── vdj_contig_info.pb
│ │ └── vloupe.vloupe
│ ├── vdj_t
│ │ ├── airr_rearrangement.tsv
│ │ ├── cell_barcodes.json
│ │ ├── clonotypes.csv
│ │ ├── concat_ref.bam
│ │ ├── concat_ref.bam.bai
│ │ ├── concat_ref.fasta
│ │ ├── concat_ref.fasta.fai
│ │ ├── consensus_annotations.csv
│ │ ├── consensus.bam
│ │ ├── consensus.bam.bai
│ │ ├── consensus.fasta
│ │ ├── consensus.fasta.fai
│ │ ├── donor_regions.fa
│ │ ├── filtered_contig_annotations.csv
│ │ ├── filtered_contig.fasta
│ │ ├── filtered_contig.fastq
│ │ ├── vdj_contig_info.pb
│ │ └── vloupe.vloupe
│ └── web_summary.html
└── vdj_reference
├── fasta
│ └── regions.fa
└── reference.json
19 directories, 70 files