Snakemake: 搭建生信工作流程 (2)

Published at 2021-05-14 22:49

Author：zhixy

书接上回，Snakemake: 搭建生信工作流程 (1) 详细介绍了snakemake的基本用法。

下面介绍一些高级功能。

为 rule 指定运行线程数

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

threads指令，规定 rule bwa_map 将使用 8 个线程运行 shell 所定义的命令 (通过{threads}传递给bwa mem 的参数 -t)。

这里需要说明的是，当运行整个工作流时，--cores 参数定义了整个工作流要使用的线程数，假设--cores 10，那么 rule bwa_map 每次运行时用掉 8 个线程，剩下的 2 个线程 snakemake 会分配给其它不超过 2 个线程 rules。

分配计算资源

除了 threads，snakemake还通过 resources 指令提供了对内存mem_mb 和硬盘disk_mb的资源分配。

rule:
    input:     ...
    output:    ...
    resources:
        mem_mb=100
    shell:
        "..."

对资源的限定，可直接定义具体的数值 (如上所示)，也可通过attempt参数，来根据任务被重运行的次数而逐步提高。运行次数在执行 snakemake 时通过--restart-times参数指定。

rule:
    input:    ...
    output:   ...
    resources:
        mem_mb=lambda wildcards, attempt: attempt * 100
    shell:
        "..."

上例中，第一次运行使用内存上限为100MB，不成功而重新运行第二次时，则会调整到200MB，直至成功，或超出运行次数上限。

巧用配置文件

到目前为止，我们通过在 Snakefile 中提供 Python 列表来指定了要使用哪些样本。但是，通常我们希望工作流是可定制的，以便它可以轻松地适应新数据。为此，snakemake 提供了一种配置文件机制 config file mechanism。

配置文件可以是 JSON 或 YAML 格式, 并用 configfile 指令定义。

先编辑一个新的文件，名为config.yaml，写入以下内容：

samples:
    A: data/samples/A.fastq
    B: data/samples/B.fastq
    C: data/samples/C.fastq

然后在 varients_calling.snakefile 的文件首写入：

configfile: "config.yaml"

snakemake 将加载配置文件并将其内容存储到一个名为config的全局可用字典中。

然后可将 rule bcftools_call 改为以下形式：

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.sorted.bam", sample=config["samples"]),
        bai=expand("sorted_reads/{sample}.sorted.bam.bai", sample=config["samples"])
    output:
        "calls/all.vcf"
    shell:
        "bcftools mpileup -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

通过配置文件config.yaml，可将分析数据对象从 Snakefile 中独立出来，当下次针对不同的数据进行分析时，只需要更改config.yaml即可。

为 rule 添加静态非文件型参数

有些时候，shell命令不仅仅是由input和output中的文件组成，还需要一些静态的参数设置。如果把这些参数放在input里，则会因为找不到文件而出错，所以需要专门的params指令来设置这些参数。

例如在序列比对时，要给每个比对bam结果，添加不同的 read group的ID，可以使用bwa mem的-R参数，例如：

bwa mem -R "@RG\tID:A\tSM:A" 为来自 A 样本 (A.fastq) 的reads添加 read group ID。

那么为来自每个样本序列文件的reads添加不同的read group ID，则可以通过params实现：

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    params:
        rg=r"@RG\tID:{sample}\tSM:{sample}"
    shell:
        "bwa mem -R '{params.rg}' {input} | samtools view -Sb - > {output}"

为防止转义，rg取值的字符串前，需要加 'r'。另外，shell 命令中{params.rg}通配符也需要'包裹。

生成日志文件

在执行大型工作流时，通常需要将每个作业 job 的日志记录输出存储到一个单独的文件中，而不是仅仅将所有日志记录输出打印到屏幕终端。而且当多个作业并行运行时，这会导致混乱的输出。为此，snakemake 允许为 rule 指定日志文件。

日志文件通过 log 指令来定义，且处理方式与 output 指令类似，但它们不受 rule 匹配 (输入输出流) 的影响，并且在作业失败时也不会被清理。

将 rule bwa_map修改如下：

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    log:
        "logs/bwa_mem/{sample}.log"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

日志文件必须包含与输出文件完全相同的通配符 (上例中即{sample})，以避免同一 rule 的不同 job 之间发生文件名冲突。

临时与保护文件

在示例工作流中，有两个 rules 会产生 bam 文件，即 bwa_map 和 samtools_sort。由于生成的 bam 文件需要大量的磁盘空间。要节省磁盘空间，可以将输出文件标记为临时文件。一旦执行所有消耗作业（需要作为输入），snakemake 将删除标记的文件。

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        temp("mapped_reads/{sample}.bam")
    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

与 rule bwa_map 衔接的是 rule samtools_sort，所以当 samtools_sort 执行完成时 bwa_map 产生的 bam 文件会被删除。

然后，由于通过序列比对和排序而得到的 bam 文件的计算成本很高，因此保护最终的 bam 文件不受意外删除或修改是合理的。所以将 rule samtools_sort 修改如下：

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        protected("sorted_reads/{sample}.bam")
    shell:
        "samtools sort -f {input} {output}"

作业成功执行后，snakemake 将写保护文件系统中的输出文件，以免意外覆盖或删除。

运行Python代码块

要在snakemake执行过程执行inline Python代码块，需借助run指令。

rule report:
    input:
        "calls/all.vcf"
    output:
        "report.html"
    run:
        from snakemake.utils import report

        with open(input[0]) as vcf:
            n_calls = sum(1 for l in vcf if not l.startswith("#"))

        report("""
        An example variant calling workflow
        ===================================

        Reads were mapped to the Yeast
        reference genome and variants were called jointly with
        SAMtools/BCFtools.

        This resulted in {n_calls} variants (see Table T1_).
        """, output[0], T1=input[0])

在run内，可以像正常写Python代码一样，把需要运行的Python语法一行行的写入。

运行外部脚本

一个 rule 除了shell 和inline Python代码块，还可以通过script指令调用外部脚本。例如：

rule R_ploting:
    input:
        "path/to/inputfile",
        "path/to/other/inputfile"
    output:
        "path/to/outputfile",
        "path/to/another/outputfile"
    script:
        "scripts/script.R"

当运行snakemake时，程序会在当前目录中寻找子目录scripts，并调用相应的脚本程序。

为 rule 定制conda环境

当运行一个Snakemake工作流时，需要一个conda环境，例如用于演示学习的snakemake-tutorial环境，被称为全局环境。

但仅仅依赖于全局环境或许还不够，对于不同的规则 rules 可能还有 Python2 和 Python3 的区别，又可能不同 rules 调用的第三方程序在同一环境中有冲突。所以还得为某些规则创建单独的环境。

snakemake提供了conda指令，用于定义 rule 内的环境。执行Snakefile时，用参数--use-conda来解析 rule 中的 conda 规则。

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    conda:
        "envs/map.yaml"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

随后在 snakemake 执行的目录下创建 envs 文件夹，增加 map.yaml 文件，内容如下:

name: map
channels:
  - bioconda
  - conda-forge
dependencies:
  - bwa=0.7.17
  - samtools=1.9
show_channel_urls: true

当用snakmake --use-conda执行时，会在.snakemake/conda下创建专门的 conda 环境用于处理当前 rule。对于当前项目，该conda环境创建之后就会一直用于该规则，除非 yaml 文件发生改变。如果希望在实际运行项目之前先创建好环境，那么可以使用--create-envs-only参数。

特定conda环境的yaml文件，可用conda env export -n conda_env_name -f conda_env_name.yaml生成。

Columns

Python ICNP Others R Linux Phylogenetics Phylogenomics Genomics Reference Evolution Bioinformatics Protocols Metagenomics Statistics MinIO—分布式对象存储服务器

A Lab of Microbial Systematics and Evolution