这个一个肿瘤外显子项目的文章发表并且公布的公共数据,我这里给出全套分析流程代码。只需要你肯实践,就可以运行成功。
PS:有些后起之秀自己运营公众号或者博客喜欢批评我们这些老人,一味的堆砌代码不给解释,恶意揣测我们是因为不懂代码的原理。我表示很无语,我写了3000多篇教程,要求我一篇篇都重复提到基础知识,我真的做不到。比如下面的流程,包括软件的用法,软件安装,注释数据库的下载,我博客都说过好几次了,直播我的基因组系列也详细解读过,我告诉你去哪里学,你却不珍惜,不当回事,呵呵。
paper;https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4812767/ a whole-exome sequencing (WES) study was performed in 161 NPC cases and 895 controls of Southern Chinese descent
We sequenced the blood samples from 161 NPC cases, including 39 EAO cases, 63 FH+ cases from 52 independent families, and 59 sporadic cases by WES and achieved an average coverage of 49-fold on target (range of 32- to 76-fold)
An additional 2,160 NPC cases and 2,433 healthy controls from Chinese Hong Kong were further examined for the selected candidate variants.
Whole-exome sequencing data for the early-age onset cases have been deposited in the Sequence Read Archive (SRA) database (accession ID SRA291701).
了解WES数据分析步骤: 2016-A Survey of Computational Tools to Analyze and Interpret Whole Exome Sequencing Data
随便选择了5个样本,其ID号及描述如下,
SRX445405 MALE NPC15 SRR1139956 NPC15F NO SRS540548 NPC15F-TSRX445406 MALE NPC15 SRR1139958 NPC15F NO SRS540549 NPC15F-NSRX445407 MALE NPC29 SRR1139966 NPC29F YES SRS540550 NPC29F-TSRX445408 MALE NPC29 SRR1139973 NPC29F YES SRS540551 NPC29F-NSRX445409 FEMALE NPC10 SRR1139999 NPC10F NO SRS540552 NPC10F-TSRX445410 FEMALE NPC10 SRR1140007 NPC10F NO SRS540553 NPC10F-NSRX445411 FEMALE NPC34 SRR1140015 NPC34F NO SRS540554 NPC34F-TSRX445412 FEMALE NPC34 SRR1140023 NPC34F NO SRS540555 NPC34F-NSRX445413 MALE NPC37 SRR1140044 NPC37F YES SRS540556 NPC37F-TSRX445414 MALE NPC37 SRR1140045 NPC37F YES SRS540557 NPC37F-N
把上面的描述文本存为文件npc.sra.txt下载脚本如下:
cat npc.sra.txt | cut -f 4|while read iddo echo $idwget -c ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP035/SRP035573/$id/$id.sradone
转换脚本如下:
cat npc.sra.txt | while read iddoarray=($id)echo ${array[3]}.sra ${array[7]} ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastq-dump --gzip --split-3 -A \ ${array[7]} ${array[3]}.sra done
得到的fastq文件如下:
3.5G Aug 26 09:48 NPC10F-N_1.fastq.gz3.6G Aug 26 09:48 NPC10F-N_2.fastq.gz3.2G Aug 26 09:48 NPC10F-T_1.fastq.gz3.3G Aug 26 09:48 NPC10F-T_2.fastq.gz2.7G Aug 26 09:47 NPC15F-N_1.fastq.gz2.8G Aug 26 09:47 NPC15F-N_2.fastq.gz2.8G Aug 26 09:47 NPC15F-T_1.fastq.gz2.9G Aug 26 09:47 NPC15F-T_2.fastq.gz2.8G Aug 26 09:47 NPC29F-N_1.fastq.gz2.9G Aug 26 09:47 NPC29F-N_2.fastq.gz2.4G Aug 26 09:47 NPC29F-T_1.fastq.gz2.5G Aug 26 09:47 NPC29F-T_2.fastq.gz2.0G Aug 26 09:47 NPC34F-N_1.fastq.gz2.0G Aug 26 09:47 NPC34F-N_2.fastq.gz2.2G Aug 26 09:48 NPC34F-T_1.fastq.gz2.3G Aug 26 09:48 NPC34F-T_2.fastq.gz2.1G Aug 26 09:47 NPC37F-N_1.fastq.gz2.1G Aug 26 09:47 NPC37F-N_2.fastq.gz2.2G Aug 26 09:46 NPC37F-T_1.fastq.gz2.2G Aug 26 09:46 NPC37F-T_2.fastq.gz
简单的走一下fastqc+multiqc 看看数据质量,一般都会很不错的,这个数据也不例外。
ls *.gz |xargs ~/biosoft/fastqc/FastQC/fastqc -o ./ -t 5
但是值得注意的是本文章中的测序reads的编码格式是phred+64 而不是传统的33
选用的是经典的GATK best practice的流程,代码如下:
/*
* 提示:该行代码过长,系统自动注释不进行高亮。一键复制会移除系统注释
* module load java/1.8.0_91GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fastaINDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jarPICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jarDBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gzSNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gzINDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gzKG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gzMills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcfKG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcfTMPDIR=/home/jianmingzeng/tmp/software## samtools and bwa are in the environment## samtools Version: 1.3.1 (using htslib 1.3.1)## bwa Version: 0.7.15-r1140#arr=($1)#fq1=${arr[0]}#fq2=${arr[1]}#sample=${arr[2]}fq1=$1fq2=$2sample=$3##################################################################### Step 1 : Alignment ######################################################################start=$(date +%s.%N)echo bwa `date`bwa mem -t 5 -M -R "@RG\tID:$sample\tSM:$sample\tLB:WES\tPL:Illumina" $INDEX $fq1 $fq2 1>$sample.sam 2>/dev/null echo bwa `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for BWA : %.6f seconds" $durecho ##################################################################### Step 2: Sort and Index ##################################################################start=$(date +%s.%N)echo SortSam `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD SortSam SORT_ORDER=coordinate INPUT=$sample.sam OUTPUT=$sample.bamsamtools index $sample.bamecho SortSam `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for SortSam : %.6f seconds" $durecho rm $sample.sam ##################################################################### Step 3: Basic Statistics ################################################################start=$(date +%s.%N)echo stats `date`samtools flagstat $sample.bam > ${sample}.alignment.flagstatsamtools stats $sample.bam > ${sample}.alignment.statecho plot-bamstats -p ${sample}_QC ${sample}.alignment.statecho stats `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for Basic Statistics : %.6f seconds" $durecho ############################################################ Step 4: multiple filtering for bam files ############################################################MarkDuplicates###start=$(date +%s.%N)echo MarkDuplicates `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD MarkDuplicates \INPUT=$sample.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metricsecho MarkDuplicates `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for MarkDuplicates : %.6f seconds" $durecho rm $sample.bam ###FixMateInfo###start=$(date +%s.%N)echo FixMateInfo `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $PICARD FixMateInformation \INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinatesamtools index ${sample}_marked_fixed.bamecho FixMateInfo `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for FixMateInfo : %.6f seconds" $durecho rm ${sample}_marked.bam ############################################################ Step 5: gatk process bam files ###################################################################### SplitNCigar ###start=$(date +%s.%N)echo SplitNCigar `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SplitNCigarReads \ -R $GENOME -I ${sample}_marked_fixed.bam -o ${sample}_marked_fixed_split.bam \-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS #--fix_misencoded_quality_scores## --fix_misencoded_quality_scores only if phred 64 echo SplitNCigar `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for SplitNCigar : %.6f seconds" $durecho rm ${sample}_marked_fixed.bam # rm ${sample}.sam ${sample}.bam ${sample}_marked.bam ${sample}_marked_fixed.bam###RealignerTargetCreator###start=$(date +%s.%N)echo RealignerTargetCreator `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T RealignerTargetCreator \-I ${sample}_marked_fixed_split.bam -R $GENOME -o ${sample}_target.intervals \-known $Mills_indels -known $KG_indels -nt 5echo RealignerTargetCreator `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for RealignerTargetCreator : %.6f seconds" $durecho ###IndelRealigner###start=$(date +%s.%N)echo IndelRealigner `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T IndelRealigner \-I ${sample}_marked_fixed_split.bam -R $GENOME -targetIntervals ${sample}_target.intervals \-o ${sample}_realigned.bam -known $Mills_indels -known $KG_indelsecho IndelRealigner `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for IndelRealigner : %.6f seconds" $durecho rm ${sample}_marked_fixed_split.bam###BaseRecalibrator###start=$(date +%s.%N)echo BaseRecalibrator `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T BaseRecalibrator \-I ${sample}_realigned.bam -R $GENOME -o ${sample}_temp.table -knownSites $DBSNPecho BaseRecalibrator `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for BaseRecalibrator : %.6f seconds" $durecho ###PrintReads###start=$(date +%s.%N)echo PrintReads `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T PrintReads \-R $GENOME -I ${sample}_realigned.bam -o ${sample}_recal.bam -BQSR ${sample}_temp.tablesamtools index ${sample}_recal.bamecho PrintReads `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for PrintReads : %.6f seconds" $durecho rm ${sample}_realigned.bamchmod uga=r ${sample}_recal.bam ################################################################### Step 6: gatk call snp/indel################################################################## start=$(date +%s.%N)echo HaplotypeCaller `date`java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T HaplotypeCaller \-R $GENOME -I ${sample}_recal.bam --dbsnp $DBSNP \-stand_emit_conf 10 -o ${sample}_raw.snps.indels.vcfecho HaplotypeCaller `date`dur=$(echo "$(date +%s.%N) - $start" | bc)printf "Execution time for HaplotypeCaller : %.6f seconds" $durecho java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \-selectType SNP \-V ${sample}_raw.snps.indels.vcf -o ${sample}_raw_snps.vcfjava -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \-selectType INDEL \-V ${sample}_raw.snps.indels.vcf -o ${sample}_raw_indels.vcf## :''## for SNPjava -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME \--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" \--filterName "my_snp_filter" \-V ${sample}_raw_snps.vcf -o ${sample}_filtered_snps.vcf java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \--excludeFiltered \-V ${sample}_filtered_snps.vcf -o ${sample}_filtered_PASS.snps.vcfjava -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantEval -R $GENOME \-eval ${sample}_filtered_PASS.snps.vcf -o ${sample}_filtered_PASS.snps.vcf.eval## for INDELjava -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantFiltration -R $GENOME \--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" \--filterName "my_indel_filter" \-V ${sample}_raw_indels.vcf -o ${sample}_filtered_indels.vcf java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SelectVariants -R $GENOME \--excludeFiltered \-V ${sample}_filtered_indels.vcf -o ${sample}_filtered_PASS.indels.vcfjava -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T VariantEval -R $GENOME \-eval ${sample}_filtered_PASS.indels.vcf -o ${sample}_filtered_PASS.indels.vcf.eval
*/
比对成功后得到的bam文件如下;
7.7G Aug 26 19:22 NPC10F-N_marked_fixed.bam7.7G Aug 26 22:59 NPC10F-N_marked_fixed_split.bam7.7G Aug 26 23:57 NPC10F-N_realigned.bam 15G Aug 27 03:45 NPC10F-N_recal.bam 7.0G Aug 26 19:49 NPC10F-T_marked_fixed.bam7.0G Aug 26 22:55 NPC10F-T_marked_fixed_split.bam7.0G Aug 26 23:48 NPC10F-T_realigned.bam 13G Aug 27 03:12 NPC10F-T_recal.bam
Snp-calling结束后得到的vcf如下:
82576 NPC10F-N_raw_indels.vcf 863243 NPC10F-N_raw_snps.vcf 945753 NPC10F-N_recal_raw.snps.indels.vcf 89604 NPC10F-T_raw_indels.vcf 819657 NPC10F-T_raw_snps.vcf 909190 NPC10F-T_recal_raw.snps.indels.vcf
这些vcf里面的变异位点还需要进行简单的过滤,或者只提取外显子区域的变异位点。(从WGS测序得到的VCF文件里面提取位于外显子区域的【直播】我的基因组84)
消耗时间如下;
# for NPC10F-N_1.fastq.gz NPC10F-N_2.fastq.gz NPC10F-Nbwa Sat Aug 26 16:04:44 CST 2017bwa Sat Aug 26 17:07:11 CST 2017SortSam Sat Aug 26 17:07:11 CST 2017SortSam Sat Aug 26 17:45:04 CST 2017stats Sat Aug 26 17:45:04 CST 2017plot-bamstats -p NPC10F-N_QC NPC10F-N.alignment.statstats Sat Aug 26 17:56:07 CST 2017MarkDuplicates Sat Aug 26 17:56:07 CST 2017MarkDuplicates Sat Aug 26 18:40:25 CST 2017FixMateInfo Sat Aug 26 18:40:25 CST 2017FixMateInfo Sat Aug 26 19:26:39 CST 2017SplitNCigar Sat Aug 26 22:20:48 CST 2017SplitNCigar Sat Aug 26 22:59:32 CST 2017RealignerTargetCreator Sat Aug 26 22:59:32 CST 2017RealignerTargetCreator Sat Aug 26 23:17:35 CST 2017IndelRealigner Sat Aug 26 23:17:35 CST 2017IndelRealigner Sat Aug 26 23:57:35 CST 2017BaseRecalibrator Sat Aug 26 23:57:35 CST 2017BaseRecalibrator Sun Aug 27 01:27:39 CST 2017PrintReads Sun Aug 27 01:27:39 CST 2017PrintReads Sun Aug 27 03:52:03 CST 2017#for NPC10F-T_1.fastq.gz NPC10F-T_2.fastq.gz NPC10F-Tbwa Sat Aug 26 16:54:14 CST 2017bwa Sat Aug 26 17:48:07 CST 2017SortSam Sat Aug 26 17:48:07 CST 2017SortSam Sat Aug 26 18:20:48 CST 2017stats Sat Aug 26 18:20:48 CST 2017plot-bamstats -p NPC10F-T_QC NPC10F-T.alignment.statstats Sat Aug 26 18:30:45 CST 2017MarkDuplicates Sat Aug 26 18:30:45 CST 2017MarkDuplicates Sat Aug 26 19:11:40 CST 2017FixMateInfo Sat Aug 26 19:11:40 CST 2017FixMateInfo Sat Aug 26 19:52:37 CST 2017SplitNCigar Sat Aug 26 22:20:54 CST 2017SplitNCigar Sat Aug 26 22:55:53 CST 2017RealignerTargetCreator Sat Aug 26 22:55:53 CST 2017RealignerTargetCreator Sat Aug 26 23:14:19 CST 2017IndelRealigner Sat Aug 26 23:14:19 CST 2017IndelRealigner Sat Aug 26 23:48:43 CST 2017BaseRecalibrator Sat Aug 26 23:48:43 CST 2017BaseRecalibrator Sun Aug 27 01:10:26 CST 2017PrintReads Sun Aug 27 01:10:26 CST 2017PrintReads Sun Aug 27 03:18:33 CST 2017
外显子的QC结果(这个是我自己写的脚本)是:
## for NPC10F-N32541594 2392028455 0.982188933530586 73.506800404430118711840 934414537 0.971564415370185 49.937073906147117075251 505674931 0.895198983425405 29.614494744469615656543 241509710 0.819070615186704 15.4254812189383## for NPC10F-T32348763 2946257280 0.97636879840624 91.077896239803718182204 1054428864 0.94406442121146 57.992356922186116075260 555403212 0.842772759844003 34.550185315820714181528 265599059 0.741905340358179 18.7285219900141
可以看到T的测序深度高于N,符合实验设计。T的外显子区域平均测序深度高达91X,是比较好的数据了,而且外显子旁边50bp范围区域平均测序深度还有57X,旁边100bp区域还有34X,也说明这款芯片的捕获效率比较好
因为是配对数据,还可以走somatic mutation calling流程
reference=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fastaGENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fastaTMPDIR=/home/jianmingzeng/tmp/softwarenormal_bam=NPC10F-N_recal.bamtumor_bam=NPC10F-T_recal.bamsample=NPC10F######################################################################## Step : Run VarScan ##################################################################normal_pileup="samtools mpileup -q 1 -f $reference $normal_bam";tumor_pileup="samtools mpileup -q 1 -f $reference $tumor_bam";# Next, issue a system call that pipes input from these commands into VarScan :java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar \somatic <($normal_pileup) <($tumor_pileup) $samplejava -jar ~/biosoft/VarScan/VarScan.v2.3.9.jar processSomatic $sample.snp######################################################################## Step : Run Mutect2 ################################################################## java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T MuTect2 \-R $GENOME -I:tumor $tumor_bam -I:normal $normal_bam \--dbsnp $DBSNP -o ${sample}-mutect2.vcf######################################################################## Step : Run Muse######################################################################~/biosoft/muse/muse call -O $sample -f $reference $tumor_bam $normal_bam~/biosoft/muse/muse sump -I ${sample}.MuSE.txt -E –O ${sample}.vcf –D $DBSNP
其中varscan耗费两个半小时,结果如下:
893539210 positions in tumor891822458 positions shared in normal79827518 had sufficient coverage for comparison79718238 were called Reference26 were mixed SNP-indel calls and filtered102492 were called Germline3703 were called LOH2569 were called Somatic490 were called Unknown0 were called VariantReading input from NPC10F.snpOpening output files: NPC10F.snp.Somatic NPC10F.snp.Germline NPC10F.snp.LOH101352 VarScan calls processed2342 were Somatic (556 high confidence)95144 were Germline (4830 high confidence)3502 were LOH (3484 high confidence)
这3个软件找到的somatic mutation可以互相对比一下,差异主要是在哪里,都值得自己去探究,这样最终确定的肿瘤外显子测序数据分析流程就更有把握。
软件太多了,我就不一一列出具体代码了,还有很多需要下载的参考基因组,变异数据库也是以前直播基因组的时候已经反复提到过,也不赘述啦。
conda install -c bioconda bedtoolsconda install -c bioconda bwaconda install -c bioconda samtoolscd ~/biosoftmkdir sratoolkit && cd sratoolkitwget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/current/sratoolkit.current-centos_linux64.tar.gztar zxvf sratoolkit.current-centos_linux64.tar.gz~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/fastdump -h## https://sourceforge.net/projects/picard/## https://github.com/broadinstitute/picardcd ~/biosoftmkdir picardtools && cd picardtoolswget http://ncu.dl.sourceforge.net/project/picard/picard-tools/1.119/picard-tools-1.119.zipunzip picard-tools-1.119.zipmkdir 2.9.2 && cd 2.9.2 wget https://github.com/broadinstitute/picard/releases/download/2.9.2/picard.jarcd ~/biosoft## https://sourceforge.net/projects/varscan/files/## http://varscan.sourceforge.net/index.htmlmkdir VarScan && cd VarScan wget https://sourceforge.net/projects/varscan/files/VarScan.v2.3.9.jar cd ~/biosoftmkdir SnpEff && cd SnpEff## http://snpeff.sourceforge.net/## http://snpeff.sourceforge.net/SnpSift.html ## http://snpeff.sourceforge.net/SnpEff_manual.htmlwget http://sourceforge.net/projects/snpeff/files/snpEff_latest_core.zip ## java -jar snpEff.jar download GRCh37.75## java -Xmx4G -jar snpEff.jar -i vcf -o vcf GRCh37.75 example.vcf > example_snpeff.vcfunzip snpEff_latest_core.zip