hello,hello!小伙伴们大家好,我是小编豆豆,最近小编在开发宏基因组流程,很多公司和小伙伴在开发流程时候,都会花大量的时间研究脚本或者软件的参数,很少有小伙伴们开发完流程或者软件会使用模拟数据来对其检验运行出来的结果是否正确。对于环境微生物(宏基因组或扩增子)来说,除了使用ZymoBIOMICS[1]微生物标准品测序数据和一些已经发表的公共数据来验证,还可以在NCBI下载基因组完成图、草图、16s rRNA等序列,使用软件将基因组打断,模拟测序数据来进行流程验证。
今天小编将结合自己前段时间的项目,给小伙伴们分享一个使用基因组数据生成测序数据的小工具——InSilicoSeq[2-3],该工具能够模拟宏基因和扩增子的测序数据。测序读长有PE150,PE300等,功能非常强大。
查看软件帮助文档
iss -h
usage: iss [subcommand] [options]
InSilicoSeq: A sequencing simulator
options:
-h, --help show this help message and exit
-v, --version print software version and exit
available subcommands:
model generate an error model from a bam file
generate simulate reads from an error model
常用参数解读
1)model子命令:从bam文件中获取测序错误率模型
2)generate子命令:根据提供的fasta序列,生成扩增子和宏基因组测序数据
usage: iss generate [-h] [--quiet | --debug] [--seed <int>] [--cpus <int>] [--genomes <genomes.fasta> [<genomes.fasta> ...]] [--draft <draft.fasta> [<draft.fasta> ...]] [--n_genomes <int>] [--ncbi [<str> ...]] [--n_genomes_ncbi [<int> ...]]
[--abundance <str> | --abundance_file <abundance.txt> | --coverage <str> | --coverage_file <coverage.txt> | --readcount_file <readcount.txt>] [--n_reads <int>] [--mode <str>] [--model <npz>] [--gc_bias] [--compress] --output <fastq>
[--sequence_type {metagenomics,amplicon}] [--fragment-length <int>] [--fragment-length-sd <int>] [--store_mutations]
simulate reads from an error model
arguments:
--help, -h show this help message and exit
--quiet, -q Disable info logging. (default: False).
--debug, -d Enable debug logging. (default: False).
--seed <int> Seed all the random number generators
--cpus <int>, -p <int>
number of cpus to use. (default: 2).
--genomes <genomes.fasta> [<genomes.fasta> ...], -g <genomes.fasta> [<genomes.fasta> ...]
Input genome(s) from where the reads will originate
--draft <draft.fasta> [<draft.fasta> ...]
Input draft genome(s) from where the reads will originate
--n_genomes <int>, -u <int>
How many genomes will be used for the simulation. is set with --genomes/-g or/and --draft to take random genomes from the input multifasta
--ncbi [<str> ...], -k [<str> ...]
Download input genomes from NCBI. Requires --n_genomes/-u option. Can be bacteria, viruses, archaea or a combination of the three (space-separated)
--n_genomes_ncbi [<int> ...], -U [<int> ...]
How many genomes will be downloaded from NCBI. Required if --ncbi/-k is set. If more than one kingdom is set with --ncbi, multiple values are necessary (space-separated).
--abundance <str>, -a <str>
abundance distribution (default: lognormal). Can be uniform, halfnormal, exponential, lognormal or zero-inflated-lognormal.
--abundance_file <abundance.txt>, -b <abundance.txt>
abundance file for coverage calculations (default: None).
--coverage <str>, -C <str>
coverage distribution. Can be uniform, halfnormal, exponential, lognormal or zero-inflated-lognormal.
--coverage_file <coverage.txt>, -D <coverage.txt>
file containing coverage information (default: None).
--readcount_file <readcount.txt>, -R <readcount.txt>
file containing read_count information (default: None).
--n_reads <int>, -n <int>
Number of reads to generate (default: 1000000). Allows suffixes k, K, m, M, g and G (ex 0.5M for 500000).
--mode <str>, -e <str>
Error model. If not specified, using kernel density estimation (default: kde). Can be kde, basic or perfect
--model <npz>, -m <npz>
Error model file. (default: None). Use HiSeq, NextSeq, NovaSeq, MiSeq or Miseq-[20,24,28,32] for a pre-computed error model provided with the software, or a file generated with iss model. If you do not wish to use a model, use --mode basic or --mode
perfect. The name of the built-in models are case insensitive.
--gc_bias, -c If set, may fail to sequence reads with abnormal GC content. (default: False)
--compress, -z Compress the output in gzip format (default: False).
--output <fastq>, -o <fastq>
Output file path and prefix (Required)
--sequence_type {metagenomics,amplicon}, -t {metagenomics,amplicon}
Type of sequencing. Can be metagenomics or amplicon (default: metagenomics).
--fragment-length <int>, -l <int>
Fragment length for metagenomics sequencing (default: None).
--fragment-length-sd <int>, -s <int>
Fragment length standard deviation for metagenomics sequencing (default: None).
--store_mutations, -M
Generates an additional VCF file with the mutations introduced in the reads
常用参数解读
1)--quiet:程序运行过程中是否显示运行日志,默认显示
2)--seed:设置随机种子,便于结果可重复性
3)--cpus:程序支持多进程,加速程序运行,默认使用2个进程运行
4)--genomes:提供的序列为基因组完成图
5) --draft:提供的序列为基因组草图
6)--n_genomes:从提供的基因文件中随机选择n条个基因组作为生成模拟数据
7)--abundance:使用统计模型生成每个基因组丰度,支持均匀分布(Uniform Distribution),半正态分布(Half-Normal Distribution),指数分布(Exponential Distribution),对数正态分布(Lognormal Distribution),零膨胀对数正态分布(Zero-Inflated Lognormal Distribution),默认为lognormal
8)--abundance_file:自定义每个基因组的丰度,是一个列表文件,第一列是基因组丰度,第二列是基因组丰度信息
9) --coverage:使用统计模型生成每个基因组测序乘数,支持均匀分布(Uniform Distribution),半正态分布(Half-Normal Distribution),指数分布(Exponential Distribution),对数正态分布(Lognormal Distribution),零膨胀对数正态分布(Zero-Inflated Lognormal Distribution),默认为lognormal
10)--coverage_file:自定义每个基因组的乘数,是一个列表文件,第一列是基因组丰度,第二列是基因组丰度信息
11)--n_read:生成reads的数目,默认为1000000,支持k, K, m, M, g , G等单位
12)--model:设定测序错误模型,支持的模型如下
13)--mode:提供自己训练的错误模型文件
14)--gc_bias:是否存在GC偏好
15) --compress:输出文件是否为压缩文件
16)--sequence_type:测序数据类型:metagenomics,amplicon
17)--fragment-length:测序打断片段
18)--fragment-length-sd:打断片段的标准差
实战演习
1.使用统计模型生成每个基因组丰度
iss generate --seed 1234 --cpus 16 --genomes SRS121011.fasta --model novaseq --n_reads 0.5M --output sample --compress --sequence_type metagenomics --fragment-length 400 --fragment-length-sd 10
说明:
1)测试基因组fasta文件:https://osf.io/thser/download
2)结果目录如下图所示:
默认没有生成的基因组突变文件.vcf文件的大小为0,可以使用参数--store_mutations, -M生成基因组vcf文件。
3)fastq文件统计
注意:参数--n_reads表示reads1和reads2加起来的reads数目,而不是reads对的丰度。
4)同时软件还生成一个基因组丰度文件sample_abundance.txt,如下图所示:
5)生成fastq
2.自定义每个基因组的的丰度
iss generate --seed 1234 --cpus 16 --genomes SRS121011.fasta --abundance_file default_abundance.txt --model novaseq --n_reads 0.5M --output default --compress --sequence_type metagenomics --fragment-length 400 --fragment-length-sd 10
说明:自定义基因组丰度文件格式。
参考资料
[1].https://www.zymoresearch.com/collections/zymobiomics-microbial-community-standards
[2].https://insilicoseq.readthedocs.io/en/latest/
[3].Hadrien Gourlé,Oskar Karlsson-Lindsjö,Juliette Hayer,et el.Simulating Illumina metagenomic data with InSilicoSeq, Bioinformatics, Volume 35, Issue 3, February 2019, Pages 521–522, https://doi.org/10.1093/bioinformatics/bty630