一般来说,测序结果如下如所示,包括barcode和部分insert。而barcode部分在demultiplexing会被去除,剩下的就只有测到的一部分insert序列。
但是,在实际操作中,由于各种原因,可能会出现“测通”的情况,也就是一部分adapter序列也被测序仪读取到,这时就要进行去接头操作
我们此处使用trim_galore。当然,可以完成这一步的软件非常多,读者可以自行探索
(rna) csw@taikang-bio:~/projects/rna-seq/1.sra$ trim_galore --help
USAGE:
trim_galore [options] <filename(s)>
-h/--help Print this help message and exits.
-v/--version Print the version information and exits.
-q/--quality <INT> Trim low-quality ends from reads in addition to adapter removal. Default Phred score: 20.
## 碱基质量小于多少的reads会被去除,一般选择Q20,严格的话也可以选用Q30
--phred33 Instructs Cutadapt to use ASCII+33 quality scores as Phred scores (Sanger/Illumina 1.9+ encoding) for quality trimming. Default: ON.
## 默认fastq文件使用phred33
-a/--adapter <STRING> Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will
try to auto-detect whether the Illumina universal, Nextera transposase or Illumina
small RNA adapter sequence was used.
## 如果没有指定接头序列,trim_galore会自动检测接头序列并去除
--stringency <INT> Overlap with adapter sequence required to trim a sequence. Defaults to a very stringent setting of 1.
## 在3‘末端至少有几个连续碱基被认为是接头序列之后会被去除,默认是1个
-e <ERROR RATE> Maximum allowed error rate (no. of errors divided by the length of the matching region) (default: 0.1)
## 允许的错误率是多少
--length <INT> Discard reads that became shorter than length INT because of either
quality or adapter trimming. A value of '0' effectively disables
this behaviour. Default: 20 bp.
For paired-end files, both reads of a read-pair need to be longer than
<INT> bp to be printed out to validated paired-end files (see option --paired).
If only one read became too short there is the possibility of keeping such
unpaired single-end reads (see --retain_unpaired). Default pair-cutoff: 20 bp.
## 在去除接头之后,长度小于这个值的reads将被舍弃。对于PE测序,一个pair两个reads的长度都要大于这个值。当然也可以保留其中超过阈值的一条而舍弃另一条。
-o/--output_dir <DIR> If specified all output will be written to this directory instead of the current
directory.
## 输出文件夹,不指定会自动创建
-j/--cores INT Number of cores to be used for trimming [default: 1].
## 多线程运行
--paired This option performs length trimming of quality/adapter/RRBS trimmed reads for
paired-end files. To pass the validation test, both sequences of a sequence pair
are required to have a certain minimum length which is governed by the option
--length (see above). If only one read passes this length threshold the
other read can be rescued (see option --retain_unpaired). Using this option lets
you discard too short read pairs without disturbing the sequence-by-sequence order
of FastQ files which is required by many aligners.
## PE测序需要指定这个参数
--retain_unpaired If only one of the two paired-end reads became too short, the longer
read will be written to either '.unpaired_1.fq' or '.unpaired_2.fq'
output files. The length cutoff for unpaired single-end reads is
governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.
## 保留pair中通过length阈值的一条,舍弃另一条
PE测序可以保留pair中通过length阈值的一条,舍弃另一条,也可以同时舍弃两条。这时我们应该如何取舍?关于这个问题,主要有两种观点:
关于这个问题的细节,可以在黄树嘉大神的公众号《碱基矿工》查看他们的精彩论战。
私以为,可以保留SE之后看SE占总reads数的比例。如果非常小(自己界定,比如<1%?)就可以舍弃。如果比较大,就分别进行对比。
mkdir 4.trimg
cd ./4.trimg/
ln -s ~/path/to/*rmrRNA.fq.gz ./
cat ../SRR_Acc_List.txt | while read id
do
echo "trim_galore --length 35 --paired --retain_unpaired --cores 16 -o ./ ${id}_rm_1.fq.gz ${id}_rm_2.fq.gz"
done > trim.sh
nohup bash trim.sh &
每一个fastq文件会产生3个文件:val.fq.gz,unpaired.fq.gz以及去接头报告。
可以看到此处SE只有不到10M,相对于总量2G的测序数据不值一提,所以可以舍弃,后续分析会比较简便。
自动检测接头序列
这部分会重新说明一下我们制定的处理参数
用来过滤的文件的一个整体状况。有33.9%的reads含有接头,因为质量问题被去掉的碱基有0.3%。
fastqc -t 16 -o ./ ./*.fq.gz
multiqc ./*zip -o ./
可以看到主要就是接头序列没有了(图片右下角)