STAR较CellRanger有着更快的运行速度和更广泛的运用场景。单细胞转录组的比对可以通过STAR-solo来实现,在solo Features 可以同时获取Gene expression和RNA velocity 信息,帮助我们进行拟时序分析。
STAR构建检索同样需要下载特定物种的FASTQ基因组文件和GTF注释文件,这里以mm10为例
STAR --runThreadN 6 \ #设置线程
--runMode genomeGenerate \ #构建参考index
--genomeDir /data/users/Alignment/index/mm10 \ #index输出路径
--genomeFastaFiles /data/users/Alignment/mm10/mm10.fa \ #参考基因组Fasta文件路径
--sjdbGTFfile /data/users/Alignment/mm10/mm10.gtf 、 #参考基因组注释GTF文件路径
--sjdbOverhang 149 #reads长度最大值减一,默认100,我已知我的reads最大长度150 该参数也可以不设置
构建index后就可以放心比对了 哈哈!
whitelist 是10X 单细胞测序中barcodes 的白名单需要在后续比对中添加,以免识别到异常的细胞,根据不同版本的建库检测方法有着不同的白名单,要明确对应,切记!!
UMILEN 和 STR也是非常重要的参数后续进行比对时,也要一一设定。
1. whitelist 可以在10XGENOMICS和CellRanger下载
2.在CellRanger的本地文件中也可以找到whitelist文件
cd /software/cellranger-7.1.0/lib/python/cellranger/barcodes
ll *.gz
cp *.gz STAR/BC
gunzip *.gz
STAR 较Cell ranger有着更加个性化的参数,根据需要自行调整
# 官网例子
# Set the number of threads (CPUs) to be used
STAR --runThreadN $CPUS
# Specify the directory containing the indexed genome files
--genomeDir $REF
# Specify the input read files for mapping
--readFilesIn $R1 $R2
# Set directory permissions for output files (All_RWX) and specify file formats (GZIP, BAM)
--runDirPerm All_RWX $GZIP $BAM
# Enable solo (single-cell) analysis options
--soloBarcodeMate 1
--clip5pNbases 39 0
--soloType CB_UMI_Simple
--soloCBwhitelist $BC
--soloCBstart 1
--soloCBlen $CBLEN
--soloUMIstart $((CBLEN+1))
--soloUMIlen $UMILEN
--soloStrand Forward
--soloUMIdedup 1MM_CR
--soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts
--soloUMIfiltering MultiGeneUMI_CR
--soloCellFilter EmptyDrops_CR
# Set minimum alignment score threshold
--outFilterScoreMin 30
# Specify the features to be included in the output (Gene, GeneFull, Velocyto)
--soloFeatures Gene Velocyto
# Specify output file names for solo analysis
--soloOutFileNames output/ features.tsv barcodes.tsv matrix.mtx
# Enable handling of multiple mappers using the Expectation-Maximization (EM) algorithm
--soloMultiMappers EM
# Specify how unmapped reads should be output (Fastx format)
--outReadsUnmapped Fastx
STAR --runThreadN 10 --soloType CB_UMI_Simple --soloFeatures Gene Velocyto \
--soloCBstart 1 \
--soloCBlen 16 \
--soloUMIstart 17 \
--soloUMIlen 12 \
--soloBarcodeReadLength 0 \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outSAMattributes NH HI AS nM CR CY UR UY \
--soloOutFileNames Solo.out/ genes.tsv barcodes.tsv matrix.mtx \
--outFileNamePrefix ./out \
--genomeDir index/mm10 \
--soloCBwhitelist /software/STAR/BC/3M-february-2018.txt \
--readFilesIn CRRXXX_r2.fastq.gz CRRXXX_f1.fastq.gz
得到结果文件
基于Python scVelo进行拟时序分析
Velocity的信息存储在Velocyto/raw/matrix.mtx中需要提取整合
# Load required libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from scipy import sparse
import anndata
import scanpy as sc
import scvelo as scv
import diopy
scv.set_figure_params()
os.chdir('/path/to/work/dir')
def buildAnndataFromStar(path):
"""Generate an anndata object from the STAR aligner output folder"""
path=path
# Load Read Counts
X = sc.read_mtx(path+'Gene/raw/matrix.mtx')
# Transpose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
X = X.X.transpose()
# This matrix is organized as a sparse matrix with Row, Cols and 3 values columns for
# Spliced, Unspliced and Ambigous reads
mtx = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=3, delimiter=' ')
# Extract sparse matrix shape informations from the third row
shape = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
# Read the sparse matrix with csr_matrix((data, (row_ind, col_ind)), shape=(M, N))
# Subract -1 to rows and cols index because csr_matrix expects a 0 based index
# Traspose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
spliced = sparse.csr_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
unspliced = sparse.csr_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
ambiguous = sparse.csr_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
# Load Genes and Cells identifiers
obs = pd.read_csv(path+'Velocyto/raw/barcodes.tsv',
header = None, index_col = 0)
# Remove index column name to make it compliant with the anndata format
obs.index.name = None
var = pd.read_csv(path+'Velocyto/raw/features.tsv', sep='\t',
names = ('gene_ids', 'feature_types'), index_col = 1)
# Build AnnData object to be used with ScanPy and ScVelo
adata = anndata.AnnData(X = X, obs = obs, var = var,
layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})
adata.var_names_make_unique()
# Subset Cells based on STAR filtering 根据自己需要调整该参数
# selected_barcodes = pd.read_csv(path+'Gene/filtered/barcodes.tsv', header = None)
# adata = adata[selected_barcodes[0]]
return adata.copy()
adata = buildAnndataFromStar('/sc-seq/CRRXXX/outSolo.out/')
# QC 步骤省略一万步
diopy.output.write_h5(adata,'data1.h5') #输出adata对象
scv.pl.proportions(adata)
就可以愉快的分析了 哈哈!
https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md(STAR-solo github) https://zhuanlan.zhihu.com/p/362727395(构建STATR-index)
https://github.com/alexdobin/STAR/issues/774(Velocity整合)
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。