```bash
fastp -w ${threads} \
-i ${data}/${pn}/${sn}_S*_R1_*.fastq.gz \
-I ${data}/${pn}/${sn}_S*_R2_*.fastq.gz \
-o ${result}/${sn}/trimmed/${sn}_trimmed_R1.fastq.gz \
-O ${result}/${sn}/trimmed/${sn}_trimmed_R2.fastq.gz \
--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
--adapter_sequence_r2=AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
--length_required 15 \
--json ${result}/${sn}/trimmed/${sn}_fastp.json \
--html ${result}/${sn}/trimmed/${sn}_fastp.html
```
从输出文件${sn}_fastp.json文件中获取过滤前后Q20,Q30比例,总的reads
```bash
samtools flagstat --threads ${threads} \
${result}/${sn}/aligned/${sn}_sorted.bam > \
${result}/${sn}/stat/${sn}_sorted.flagstat
```
从输出文件${sn}_marked.flagstat文件中获取mapping的一些信息,比如mapping比例,比对到参考基因组上的比例
````bash
samtools depth -a -b ${ref.bed} --threads ${threads} \
${result}/${sn}/aligned/${sn}_marked.bam > \
${result}/${sn}/stat/${sn}_marked.depth
````
输出所有区域文件${ref.bed}位点的测序深度,然后统计整体的测序深度,比如1× 10× 20× 等测序深度下的覆盖率,总体的平均测序深度和中位数测序深度
```bash
gatk CollectInsertSizeMetrics \
-I ${result}/${sn}/aligned/${sn}_normal_marked.bam \
-O ${result}/${sn}/stat/${sn}_normal_insertsize_metrics.txt \
-H ${result}/${sn}/stat/${sn}_normal_insertsize_histogram.pdf
```
里面有一项内容,计算捕获特异性:基于序列比对后的数据进行重复序列去除,比对到目标基因区域的碱基数量与比对到全基因组上区域的碱基数据量的比值:
### 我陷入了沉思,本着能用现有的轮子不用自己写的想法,我搜索到了bamdst这个软件替换掉samtools的输出,用法如下:
```bash
#github 下载源码
git clone https://github.com/shiquan/bamdst.git
#编译
cd bamdst && make
#运行
/opt/ref/bamdst/bamdst \
-p /opt/ref/projects/lang_cancer_hg38.bed \
-o ./bamqc \
--cutoffdepth 20 \
./aligned/RD1703007FFP_marked.bam
#注:遗憾的是这里--cutoffdepth 参数只能增加一个参数,不能增加多个数值
```
运行结束后输出7个文件,
- chromosomes.report
- coverage.report
- depth_distribution.plot
- depth.tsv.gz
- insertsize.plot
- region.tsv.gz
- uncover.bed
```
[Total] Raw Reads (All reads) // All reads in the bam file(s).
[Total] QC Fail reads // Reads number failed QC, this flag is marked by other software,like bwa. See flag in the bam structure.
[Total] Raw Data(Mb) // Total reads data in the bam file(s).
[Total] Paired Reads // Paired reads numbers.
[Total] Mapped Reads // Mapped reads numbers.
[Total] Fraction of Mapped Reads // Ratio of mapped reads against raw reads.
[Total] Mapped Data(Mb) // Mapped data in the bam file(s).
[Total] Fraction of Mapped Data(Mb) // Ratio of mapped data against raw data.
[Total] Properly paired // Paired reads with properly insert size. See bam format protocol for details.
[Total] Fraction of Properly paired // Ratio of properly paired reads against mapped reads
[Total] Read and mate paired // Read (read1) and mate read (read2) paired.
[Total] Fraction of Read and mate paired // Ratio of read and mate paired against mapped reads
[Total] Singletons // Read mapped but mate read unmapped, and vice versa.
[Total] Read and mate map to diff chr // Read and mate read mapped to different chromosome, usually because mapping error and structure variants.
[Total] Read1 // First reads in mate paired sequencing
[Total] Read2 // Mate reads
[Total] Read1(rmdup) // First reads after remove duplications.
[Total] Read2(rmdup) // Mate reads after remove duplications.
[Total] forward strand reads // Number of forward strand reads.
[Total] backward strand reads // Number of backward strand reads.
[Total] PCR duplicate reads // PCR duplications.
[Total] Fraction of PCR duplicate reads // Ratio of PCR duplications.
[Total] Map quality cutoff value // Cutoff map quality score, this value can be set by -q. default is 20, because some variants caller like GATK only consider high quality reads.
[Total] MapQuality above cutoff reads // Number of reads with higher or equal quality score than cutoff value.
[Total] Fraction of MapQ reads in all reads // Ratio of reads with higher or equal Q score against raw reads.
[Total] Fraction of MapQ reads in mapped reads // Ratio of reads with higher or equal Q score against mapped reads.
[Target] Target Reads // Number of reads covered target region (specified by bed file).
[Target] Fraction of Target Reads in all reads // Ratio of target reads against raw reads.
[Target] Fraction of Target Reads in mapped reads // Ratio of target reads against mapped reads.
[Target] Target Data(Mb) // Total bases covered target region. If a read covered target region partly, only the covered bases will be counted.
[Target] Target Data Rmdup(Mb) // Total bases covered target region after remove PCR duplications.
[Target] Fraction of Target Data in all data // Ratio of target bases against raw bases.
[Target] Fraction of Target Data in mapped data // Ratio of target bases against mapped bases.
[Target] Len of region // The length of target regions.
[Target] Average depth // Average depth of target regions. Calculated by "target bases / length of regions".
[Target] Average depth(rmdup) // Average depth of target regions after remove PCR duplications.
[Target] Coverage (>0x) // Ratio of bases with depth greater than 0x in target regions, which also means the ratio of covered regions in target regions.
[Target] Coverage (>=4x) // Ratio of bases with depth greater than or equal to 4x in target regions.
[Target] Coverage (>=10x) // Ratio of bases with depth greater than or equal to 10x in target regions.
[Target] Coverage (>=30x) // Ratio of bases with depth greater than or equal to 30x in target regions.
[Target] Coverage (>=100x) // Ratio of bases with depth greater than or equal to 100x in target regions.
[Target] Coverage (>=Nx) // This is addtional line for user self-defined cutoff value, see --cutoffdepth
[Target] Target Region Count // Number of target regions. In normal practise,it is the total number of exomes.
[Target] Region covered > 0x // The number of these regions with average depth greater than 0x.
[Target] Fraction Region covered > 0x // Ratio of these regions with average depth greater than 0x.
[Target] Fraction Region covered >= 4x // Ratio of these regions with average depth greater than or equal to 4x.
[Target] Fraction Region covered >= 10x // Ratio of these regions with average depth greater than or equal to 10x.
[Target] Fraction Region covered >= 30x // Ratio of these regions with average depth greater than or equal to 30x.
[Target] Fraction Region covered >= 100x // Ratio of these regions with average depth greater than or equal to 100x.
[flank] flank size // The flank size will be count. 200 bp in default. Oligos could also capture the nearby regions of target regions.
[flank] Len of region (not include target region) // The length of flank regions (target regions will not be count).
[flank] Average depth // Average depth of flank regions.
[flank] flank Reads // The total number of reads covered the flank regions. Note: some reads covered the edge of target regions, will be count in flank regions also.
[flank] Fraction of flank Reads in all reads // Ratio of reads covered in flank regions against raw reads.
[flank] Fraction of flank Reads in mapped reads // Ration of reads covered in flank regions against mapped reads.
[flank] flank Data(Mb) // Total bases in the flank regions.
[flank] Fraction of flank Data in all data // Ratio of total bases in the flank regions against raw data.
[flank] Fraction of flank Data in mapped data // Ratio of total bases in the flank regions against mapped data.
[flank] Coverage (>0x) // Ratio of flank bases with depth greater than 0x.
[flank] Coverage (>=4x) // Ratio of flank bases with depth greater than or equal to 4x.
[flank] Coverage (>=10x) // Ratio of flank bases with depth greater than or equal to 10x.
[flank] Coverage (>=30x) // Ratio of flank bases with depth greater than or equal to 30x.
```
```python
#!/usr/bin/python
#-*-coding:utf-8-*-
import os,sys,collections,pandas,numpy,getopt,logging,json
__author__ = '豆浆包子'
__all__ = [
'SingleQcUtil',
'process',
'doProcess',
'collectFastqInfo',
'collectBamInfo',
'collectInsertSize'
'usage'
]
class SingleQcUtil(object):
"""
根据以下软件输出结果,生成最终Qc文件
fastp fastp.json
bamdst coverage.report
gatk CollectInsertSizeMetrics(Picard)
"""
def __init__(self):
'''
初始化:验证工具,获取脚本参数,Usage使用说明等
'''
logging.basicConfig(
level = logging.INFO,
format = '%(asctime)s - %(filename)s[line:%(lineno)d] - %(levelname)s: %(message)s'
)
self.log = logging.getLogger(__name__)
self.sn ='Sample'
self.bed = None
self.out = None
self.sample_fastp = None
self.sample_bamdst = None
self.sample_insertsize = None
try:
self.opts = getopt.getopt(
sys.argv[1:],
"o:dh",
[
"out=",
"sample-fastp=",
"sample-bamdst=",
"sample-insertsize=",
"document",
"help"
]
)
except getopt.GetoptError:
print('错误:请检查您的输入参数是否正确\n\n')
self.usage()
sys.exit()
if len(self.opts[0])==0:
self.usage()
sys.exit()
#获取命令行参数值
for arg, value in self.opts[0]:
if arg=="-o" or arg == "--out":
if value.endswith('.xls'):
self.out = value
if os.path.isfile(value) :
self.log.warn('[-o]或[--out]参数值 %s 文件已经存在,输出结果将覆盖该文件' %(value))
else:
print('[-o]或[--out]参数值 %s 不是一个有效的文件(以%s结尾)' %(value,'.xls'))
elif arg == "--sample-fastp":
if os.path.isfile(value) and value.endswith('.json'):
self.sample_fastp = value
fullpath = os.path.realpath(value)
fileOnly = os.path.basename(fullpath)
self.sn = fileOnly[0:fileOnly.rindex('_fastp.json')]
else:
print('[--sample-fastp]参数值 %s 不是一个有效的文件' % value)
sys.exit()
elif arg == "--sample-bamdst":
if os.path.isfile(value) :
self.sample_bamdst=value
else:
print('[--sample-bamdst]参数值 %s 不是一个有效的文件' % value)
sys.exit()
elif arg == "--sample-insertsize":
if os.path.isfile(value):
self.sample_insertsize=value
else:
print('[--sample-insertsize]参数值 %s 不是一个有效的文件' % value)
sys.exit()
elif arg=="-h" or arg == "--help":
self.usage()
sys.exit()
elif arg=="-d" or arg=="--document":
import SingleQcUtil
help(SingleQcUtil)
sys.exit()
ps_stats = False
if self.out is None:
print('[-o]或[--out]参数值不能为空')
ps_stats = True
if self.sample_fastp is None:
print('[--sample-fastp]参数值不能为空')
if self.sample_bamdst is None:
print('[--sample-bamdst]参数值不能为空')
ps_stats = True
if self.sample_insertsize is None:
print('[--sample-insertsize]参数值不能为空')
ps_stats = True
if ps_stats:
sys.exit()
self.cpath = os.getcwd()
self.log.info('WorkingDir is : %s' % self.cpath)
def doProcess(self):
"""
处理传入的vcf文件和annovar注释文件,输出结果
args:
fastp fastp 输出文件
sample_bamdst bamdst 输出文件
sample_insertsize gatk CollectInsertSizeMetrics 输出文件
return:
无返回值
"""
dic = collections.OrderedDict()
dic['Sample Number'] = self.sn
dic['Total Reads (M) Before Filtering']='NAN'
dic['Q20 Rate Before Filtering'] ='NAN'
dic['Q30 Rate Before Filtering'] ='NAN'
dic['GC Rate Before Filtering'] ='NAN'
dic['Total Reads (M) After Filtering']='NAN'
dic['Q20 Rate After Filtering'] ='NAN'
dic['Q30 Rate After Filtering'] ='NAN'
dic['GC Rate After Filtering'] ='NAN'
dic['Mapping Rate'] ='NAN'
dic['Duplicate Rate'] ='NAN'
dic['Target in Mapping Rate'] ='NAN'
dic['Region Length'] ='NAN'
dic['Average Depth'] ='NAN'
dic['1× Coverage'] ='NAN'
dic['4× Coverage'] ='NAN'
dic['10× Coverage'] ='NAN'
dic['20× Coverage'] ='NAN'
dic['30× Coverage'] ='NAN'
dic['100× Coverage'] ='NAN'
#dic['500× Coverage'] ='NAN'
dic['mean insert size'] ='NAN'
dic = self.collectFastqInfo(self.sample_fastp,dic)
dic = self.collectBamInfo(self.sample_bamdst,dic)
dic = self.collectInsertSize(self.sample_insertsize,dic)
result = pandas.DataFrame([dic],index=[0])
#result.drop(['covered','total'],axis=1,inplace=True)
self.log.info('writting result to file %s',self.out)
print(result.stack())
result.to_csv(self.out,index=False,encoding='utf-8',sep='\t')
def collectFastqInfo(self,filename,dic):
'''
从fastp输出json文件中读取过滤前后Total Reads Q20比例,Q30比例,GC %百分比
'''
if os.path.isfile(filename):
try:
with open(filename,'r') as load_f:
load_dict = json.load(load_f)
dic['Total Reads (M) Before Filtering'] = format(float(load_dict['summary']['before_filtering']['total_reads'])/1000000,'.2f')
dic['Q20 Rate Before Filtering'] = format(load_dict['summary']['before_filtering']['q20_rate'],'.2%')
dic['Q30 Rate Before Filtering'] = format(load_dict['summary']['before_filtering']['q30_rate'],'.2%')
dic['GC Rate Before Filtering'] = format(load_dict['summary']['before_filtering']['gc_content'],'.2%')
dic['Total Reads (M) After Filtering'] = format(float(load_dict['summary']['after_filtering']['total_reads'])/1000000,'.2f')
dic['Q20 Rate After Filtering'] = format(load_dict['summary']['before_filtering']['q20_rate'],'.2%')
dic['Q30 Rate After Filtering'] = format(load_dict['summary']['before_filtering']['q30_rate'],'.2%')
dic['GC Rate After Filtering'] = format(load_dict['summary']['before_filtering']['gc_content'],'.2%')
except Exception,e:
self.log.error(e)
return dic
def collectBamInfo(self,filename,dic):
'''
读取bamdst输出文件coverage.report,获取相关字段
Mapping Rate,Duplicate Rate,Target Reads in Mapping Rate,
Average Depth,1×,4×,10×,20,30×,100×,500× Coverage@
'''
if os.path.isfile(filename):
f = open(filename)
line = f.readline()
while line:
#print (line)
line = f.readline()
if line.startswith('##') or len(line)==0:
continue
arrs = line.split('\t')
key = arrs[0].strip()
val = arrs[1].split('\n')[0]
if key=='[Total] Fraction of Mapped Reads':
dic['Mapping Rate']=val
elif key=='[Total] Fraction of PCR duplicate reads':
dic['Duplicate Rate']=val
elif key=='[Target] Fraction of Target Reads in mapped reads':
dic['Target in Mapping Rate']=val
elif key=='[Target] Len of region':
dic['Region Length']=val
elif key=='[Target] Average depth(rmdup)':
dic['Average Depth']=val
elif key=='[Target] Coverage (>0x)':
dic['1× Coverage']=val
elif key=='[Target] Coverage (>=4x)':
dic['4× Coverage']=val
elif key=='[Target] Coverage (>=10x)':
dic['10× Coverage']=val
elif key=='[Target] Coverage (>=20x)':
dic['20× Coverage']=val
elif key=='[Target] Coverage (>=30x)':
dic['30× Coverage']=val
elif key=='[Target] Coverage (>=100x)':
dic['100× Coverage']=val
#elif key=='[Target] Coverage (>=500x)':
# dic['500× Coverage']=val
f.close()
return dic
def collectInsertSize(self,filename,dic):
'''
gatk CollectInsertSizeMetrics 输出文件获取mean insert size
'''
if os.path.isfile(filename):
temp_data = pandas.read_csv(
filename,
names=[
'MEDIAN_INSERT_SIZE',
'MODE_INSERT_SIZE',
'MEDIAN_ABSOLUTE_DEVIATION',
'MIN_INSERT_SIZE',
'MAX_INSERT_SIZE',
'MEAN_INSERT_SIZE',
'STANDARD_DEVIATION',
'READ_PAIRS',
'PAIR_ORIENTATION',
'WIDTH_OF_10_PERCENT',
'WIDTH_OF_20_PERCENT',
'WIDTH_OF_30_PERCENT',
'WIDTH_OF_40_PERCENT',
'WIDTH_OF_50_PERCENT',
'WIDTH_OF_60_PERCENT',
'WIDTH_OF_70_PERCENT',
'WIDTH_OF_80_PERCENT',
'WIDTH_OF_90_PERCENT',
'WIDTH_OF_95_PERCENT',
'WIDTH_OF_99_PERCENT',
'SAMPLE',
'LIBRARY',
'READ_GROUP'
],
sep='\t',
header=0,
nrows=1,
comment='#'
)
dic['mean insert size']=int(temp_data['MEAN_INSERT_SIZE'][0])
return dic
def usage(self):
'''
打印输出Usage
'''
print('Usage : ./SingleQcUtil [OPTION]... 或者 python SingleQcUtil [OPTION]')
print('''
根据fastp,bamdst,gatk CollectInsertSizeMetrics(picard)
输出质控分析结果文件,扩展名为.xls
Example:
./SingleQcUtil.py \\
-o /opt/result/2019.result.QC.xls \\
--sample-fastp=/opt/result/2019_fastp.json \\
--sample-bamdst=/opt/result/coverage.report \\
--sample-insertsize=/opt/result/2019_insertsize_metrics.txt
或者:
python SingleQcUtil.py \\
--out=/opt/result/2019.result.QC.xls \\
--sample-fastp=/opt/result/2019_fastp.json \\
--sample-bamdst=/opt/result/coverage.report \\
--sample-insertsize=/opt/result/2019_insertsize_metrics.txt
''')
print('''部分使用方法及参数如下:\n''')
print('-o, --out=\t\t输出处理结果文件')
print('--sample-fastp=\t\tfastp 处理后的输出文件')
print('--sample-bamdst=\tbamdst分析bam文件的输出文件')
print('--sample-insertsize=\tgatk CollectInsertSizeMetrics(Picard) 统计bam文件的输出文件')
print('-h, --help\t\t显示帮助')
print('-d, --document\t\t显示开发文档')
print('\n')
print('提交bug,to <6041738@qq.com>.\n')
if __name__ == '__main__':
f = SingleQcUtil()
f.doProcess()buzhi
```
### 用法:
```python
python SingleQcUtil.py \
--out=RD1703007FFP.QC.xls \
--sample-fastp=RD1703007FFP_fastp.json \
--sample-bamdst=coverage.report \
--sample-insertsize=RD1703007FFP_insertsize_metrics.txt
```
| 项目 | 数值 |
| -------------------------------- | ----------: |
| Sample Number | RD170300FFP |
| Total Reads (M) Before Filtering | 1.54 |
| Q20 Rate Before Filtering | 96.92% |
| Q30 Rate Before Filtering | 92.25% |
| GC Rate Before Filtering | 48.99% |
| Total Reads (M) After Filtering | 1.50 |
| Q20 Rate After Filtering | 96.92% |
| Q30 Rate After Filtering | 92.25% |
| GC Rate After Filtering | 48.99% |
| Mapping Rate | 99.98% |
| Duplicate Rate | 45.00% |
| Target in Mapping Rate | 17.69% |
| Region Length | 55953 |
| Average Depth | 154.77 |
| 1× Coverage | 25.06% |
| 4 × Coverage | 21.93% |
| 10× Coverage | 21.16% |
| 20 × Coverage | 20.71% |
| 30× Coverage | 20.36% |
| 100× Coverage | 19.00% |
| mean Insert size | 252 |
##
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。