我自己一直在寻求可以将不同的工作流串接的方式。之前尝试了nextflow,但发现语法让我头疼。无奈发现了基于python 框架的snakemake,如释重负,立马学一下。
按照教程里的说法,学习这样一个流程化的语言,如同投资般:
the benefits of learning Snakemake will continue to pay dividends for years to come.
传统的shell 脚本开发的流程,其是输入为导向的,以测序数据为例,数据下载、过滤、质控、比对…… 比较麻烦的是,如果其中某个步骤发生了问题,可能需要很多的事件去定位发生问题的某一个或多个步骤进行debug。
而snakemake 则是一种以输出为导向,向后回顾backward-looking
的方法,其工作流首先确定需要的输出文件类型,接下来选择适当地输入文件及软件以得到对应的输出。
snakemake 的工作流可以简单概括为:1)首先定义一些规则;2)设置需要的输出类型,snakemake 将会判断需要何种软件或流程以获得对应的输出类型。
这种输出为导向的方法具有以下优点:
此外,snakemake 还可以与conda 搭配。
官方教程使用mamba 下载:
conda install -n base -c conda-forge mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
conda activate snakemake
帮助文档,安装成功:
$ snakemake -h | head
usage: snakemake [-h] [--dry-run] [--profile PROFILE] [--cache [RULE ...]]
[--snakefile FILE] [--cores [N]] [--jobs [N]]
[--local-cores N] [--resources [NAME=INT ...]]
[--set-threads RULE=THREADS [RULE=THREADS ...]]
[--max-threads MAX_THREADS]
[--set-resources RULE:RESOURCE=VALUE [RULE:RESOURCE=VALUE ...]]
[--set-scatter NAME=SCATTERITEMS [NAME=SCATTERITEMS ...]]
[--default-resources [NAME=INT ...]]
[--preemption-default PREEMPTION_DEFAULT]
[--preemptible-rules PREEMPTIBLE_RULES [PREEMPTIBLE_RULES ...]]
假设我们现在下载了一些测序数据:
mkdir -p data/raw
touch data/raw/00{1..5}_R{1,2}.fq
一共五个双端测序数据:
$ tree
.
└── data
└── raw
├── 001_R1.fq
├── 001_R2.fq
├── 002_R1.fq
├── 002_R2.fq
├── 003_R1.fq
├── 003_R2.fq
├── 004_R1.fq
├── 004_R2.fq
├── 005_R1.fq
└── 005_R2.fq
2 directories, 10 files
接下来我们就书写第一个Snakefile
脚本:
rule trim_awesome:
input:
"data/raw/001_R1.fq",
"data/raw/001_R2.fq"
output:
"results/awesome/001_R1.fq",
"results/awesome/001_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
假设我们有一个程序是TrimmoMcAwesome
,其功能是对测序数据进行过滤。这里我们就可以针对这个程序,编写一个snakemake 流程规则trim_awesome
。这个规则让raw 文件夹中的测序数据作为输入,经过TrimmoMcAwesome
处理后,输出到awesome 中。
写好了Snakefile
脚本,我们就可以在同一目录下执行:
$ snakemake -np
Building DAG of jobs...
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
trim_awesome 1 1 1
total 1 1 1
[Tue May 24 06:55:56 2022]
rule trim_awesome:
input: data/raw/001_R1.fq, data/raw/001_R2.fq
output: results/awesome/001_R1.fq, results/awesome/001_R2.fq
jobid: 0
resources: tmpdir=/tmp
TrimmoMcAwesome data/raw/001_R1.fq data/raw/001_R2.fq results/awesome/001_R1.fq results/awesome/001_R2.fq
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
trim_awesome 1 1 1
total 1 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
-n 参数为试运行,-p 则将输出信息打印到shell。
我们可以仔细解读一下上面打印的snakemake 的执行过程。
第一个区块展示了执行的任务数目:
Building DAG of jobs...
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
trim_awesome 1 1 1
total 1 1 1
第二个区块展示了规则输入与输出对应的值:
rule trim_awesome:
input: data/raw/001_R1.fq, data/raw/001_R2.fq
output: results/awesome/001_R1.fq, results/awesome/001_R2.fq
jobid: 0
resources: tmpdir=/tmp
第三个区块展示了具体的命令,以及任务信息的总结:
TrimmoMcAwesome data/raw/001_R1.fq data/raw/001_R2.fq results/awesome/001_R1.fq results/awesome/001_R2.fq
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
trim_awesome 1 1 1
total 1 1 1
因为我们在规则文件Snakefile
设置了output 对应的文件,否则我们在调用snakemake 的时候,需要显式地设置output 对应的文件:
snakemake -np results/awesome/001_R1.fq results/awesome/001_R2.fq
如果我们在已经设定好rule 的情况下,命令行中指定不同的output 文件呢?
$ snakemake -np results/awesome/002_R1.fq results/awesome/002_R2.fq
Building DAG of jobs...
MissingRuleException:
No rule to produce results/awesome/002_R2.fq (if you use input functions make sure that they don't raise unexpected exceptions).
如果这样的话,岂不是每对测序数据,都需要专门写一个规则文件,使用echo 传递变量打印出来吗?
但问题是,也不好修改规则啊。牵一发而动全身,如果执行软件TrimmoMcAwesome
要修改某些参数,又得再echo 一遍吗?
rule trim_awesome_001:
.....
rule trim_awesome_002:
.....
有为伟大的人说过,“正则是我的光;通配符是我的太阳。”
借助通配符,我们可以将规则修改如下:
rule trim_awesome:
input:
"data/raw/{sample}_R1.fq",
"data/raw/{sample}_R2.fq"
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
这时候如果我们再直接运行snakemake:
$ snakemake -np
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).
虽然我们知道通配符代表了我们将要输入输出文件的命名范式,但snakemake 并不知道对应哪些文件。因此,这时候我们就需要显式的去指定输出的文件了:
snakemake -np results/awesome/002_R1.fq results/awesome/002_R2.fq
成功运行了!
因为此时,snakemake 成功地将我们指定的文件对应到了规则中的通配符位置。
这个过程总结如下:
同样地,在命令行中我们也可以使用通配符:
$ snakemake -np results/awesome/00{1..3}_R{1,2}.fq
Building DAG of jobs...
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
trim_awesome 3 1 1
total 3 1 1
[Tue May 24 07:28:57 2022]
rule trim_awesome:
input: data/raw/002_R1.fq, data/raw/002_R2.fq
output: results/awesome/002_R1.fq, results/awesome/002_R2.fq
jobid: 2
wildcards: sample=002
resources: tmpdir=/tmp
TrimmoMcAwesome data/raw/002_R1.fq data/raw/002_R2.fq results/awesome/002_R1.fq results/awesome/002_R2.fq
[Tue May 24 07:28:57 2022]
rule trim_awesome:
input: data/raw/003_R1.fq, data/raw/003_R2.fq
output: results/awesome/003_R1.fq, results/awesome/003_R2.fq
jobid: 0
wildcards: sample=003
resources: tmpdir=/tmp
TrimmoMcAwesome data/raw/003_R1.fq data/raw/003_R2.fq results/awesome/003_R1.fq results/awesome/003_R2.fq
[Tue May 24 07:28:57 2022]
rule trim_awesome:
input: data/raw/001_R1.fq, data/raw/001_R2.fq
output: results/awesome/001_R1.fq, results/awesome/001_R2.fq
jobid: 1
wildcards: sample=001
resources: tmpdir=/tmp
TrimmoMcAwesome data/raw/001_R1.fq data/raw/001_R2.fq results/awesome/001_R1.fq results/awesome/001_R2.fq
Job stats:
job count min threads max threads
------------ ------- ------------- -------------
trim_awesome 3 1 1
total 3 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
如果我们的规则中只有一个任务,那和一般的脚本并没有太大的区别。所谓工作流,自然不止一个执行任务。
假如我们除了程序TrimmoMcAwesome
以外,还有一个程序TrimStupendous
。
我们直接在规则中加上它就好了。
rule trim_awesome:
input:
"data/raw/{sample}_R1.fq",
"data/raw/{sample}_R2.fq"
output:
"results/awesome/{sample}_R1.fq",
"results/awesome/{sample}_R2.fq"
shell:
"TrimmoMcAwesome {input} {output}"
rule trim_stupendous:
input:
"data/raw/{sample}_R1.fq",
"data/raw/{sample}_R2.fq"
output:
"results/stupendous/{sample}_R1.fq",
"results/stupendous/{sample}_R2.fq"
shell:
"TrimStupendous {input} {output}"
此时调用snakemake,我们就需要指定不同的输出了:
$ snakemake -np results/awesome/00{1..2}_R{1,2}.fq results/stupendous/00{1..2}_R{1,2}.fq
Building DAG of jobs...
Job stats:
job count min threads max threads
--------------- ------- ------------- -------------
trim_awesome 2 1 1
trim_stupendous 2 1 1
total 4 1 1
[Tue May 24 07:35:47 2022]
rule trim_stupendous:
input: data/raw/001_R1.fq, data/raw/001_R2.fq
output: results/stupendous/001_R1.fq, results/stupendous/001_R2.fq
jobid: 2
wildcards: sample=001
resources: tmpdir=/tmp
TrimStupendous data/raw/001_R1.fq data/raw/001_R2.fq results/stupendous/001_R1.fq results/stupendous/001_R2.fq
[Tue May 24 07:35:47 2022]
rule trim_stupendous:
input: data/raw/002_R1.fq, data/raw/002_R2.fq
output: results/stupendous/002_R1.fq, results/stupendous/002_R2.fq
jobid: 3
wildcards: sample=002
resources: tmpdir=/tmp
TrimStupendous data/raw/002_R1.fq data/raw/002_R2.fq results/stupendous/002_R1.fq results/stupendous/002_R2.fq
[Tue May 24 07:35:47 2022]
rule trim_awesome:
input: data/raw/001_R1.fq, data/raw/001_R2.fq
output: results/awesome/001_R1.fq, results/awesome/001_R2.fq
jobid: 0
wildcards: sample=001
resources: tmpdir=/tmp
TrimmoMcAwesome data/raw/001_R1.fq data/raw/001_R2.fq results/awesome/001_R1.fq results/awesome/001_R2.fq
[Tue May 24 07:35:47 2022]
rule trim_awesome:
input: data/raw/002_R1.fq, data/raw/002_R2.fq
output: results/awesome/002_R1.fq, results/awesome/002_R2.fq
jobid: 1
wildcards: sample=002
resources: tmpdir=/tmp
TrimmoMcAwesome data/raw/002_R1.fq data/raw/002_R2.fq results/awesome/002_R1.fq results/awesome/002_R2.fq
Job stats:
job count min threads max threads
--------------- ------- ------------- -------------
trim_awesome 2 1 1
trim_stupendous 2 1 1
total 4 1 1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
通常来说,snakemake 是让各个文件,独自从input 经过各种规则,输出到output的。除非我们像上面的语法一样,在input 中特别的指定了有多个文件,比如变量fq1, fq2 等等。
那么,形如bcftools joint call 模式,有若干个bam 文件,难道一个个手打全部的input吗?
snakemake 非常贴心的提供了expand 函数:
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"bcftools mpileup -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
我们只需要创建一个SAMPLES 列表即可。
比如我们的sample 是A..Z 的字母:
import string
str = string.ascii_uppercase
SAMPLES = list(str)
>>> SAMPLES
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R'
, 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
我们可以通过python 语法,来设计我们的输入list,而不必一个个手动敲入。
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有