Snakemake executor plugin: pcluster-slurm

https://img.shields.io/badge/repository-github-blue?color=#022c22 https://img.shields.io/badge/author-John Major-purple?color=#064e3b PyPI - Version PyPI - License

SLURM is a widely used batch system for performance compute clusters.

AWS Parallel Cluster is a framework to deploy and manage dynamically scalable HPC clusters on AWS, running SLURM as the batch system, and pcluster manages all of the creating, configuring, and deleting of the cluster compute nodes. Nodes may be spot or dedicated. note, the AWS Parallel Cluster port of slurm has a few small, but critical differences from the standard slurm distribution. Importantly, sacct is not enabled, so any reliance on this command will break.

This executor plugin allows to use slurm via AWS Parallel Cluster as an executor for snakemake via a pcluster headnode. For example usage of this executor, see: daylily.

Installation

Install this plugin by installing it with pip or mamba, e.g.:

pip install snakemake-executor-plugin-pcluster-slurm

Usage

In order to use the plugin, run Snakemake (>=8.0) with the corresponding value for the executor flag:

snakemake --executor pcluster-slurm ...

with ... being any additional arguments you want to use.

The executor plugin has the following settings:

Settings

CLI argument

Description

Default

Choices

Required

Type

--pcluster-slurm-init-seconds-before-status-checks VALUE

Defines the time in seconds before the first status check is performed after job submission.

40

--pcluster-slurm-requeue VALUE

Allow requeuing preempted of failed jobs, if no cluster default. Results in sbatch … –requeue … This flag has no effect, if not set.

False

Further details

The Executor Plugin for HPC Clusters using the SLURM Batch System

The general Idea

Upon creating a valid AWS Parallel Cluster (and one which uses slurm as the scheduler), to use this plugin: a. log in to your cluster’s head node, b. activate your snakemake environment as usual (which should also have snakemake-executor-plugin-pcluster-slurm pip installed), and run a snakemake workflow with --executor pcluster-slurm. Snakemake will then submit your jobs as cluster jobs & pcluster slurm will manage creating compute nodes, bidding on them, managing budget constraints, spinning them down, etc. It is magic really.

A Few Important pcluster slurm Peculiarities

  • slurm for AWS Parallel Cluster does not ask for an account.

  • --wrap behaves oddly, and I just avoid its use and <<EOF\n#!/bin/bash\n<submitstring>\nEOF instead.

  • --partition is needed, you may set it with slurm_partition=<your SLURM partition> or <your SLURM partition1>,<your SLURM partition2>,<your SLURM partition3>.

  • --comment flag will be set to RandD unless a string is detected in the env variable SMK_SLURM_COMMENT. If project level tagging of resources is enabled in you pcluster cluster, this comment will need to match the per-user allowed projects specified in creating the ephemeral cluster. For more on project tagging, budget tracking and blocking on exceeded budgets, see HERE.

Inherited Docs, Not Verfied for pcluster slurm

W A R N I N G

I do find it madeningly confusing as to how to set sbatch flags globally to apply to all rules… need to try again to figure this out.

To specify them at the command line, define them as default resources:

$ snakemake --executor slurm --default-resources  slurm_partition=<your SLURM partition>

If individual rules require e.g. a different partition, you can override the default per rule:

$ snakemake --executor slurm --default-resources slurm_partition=<your SLURM partition> --set-resources <somerule>:slurm_partition=<some other partition>

Usually, it is advisable to persist such settings via a configuration profile, which can be provided system-wide, per user, and in addition per workflow.

The executor waits per default 40 seconds for its first check of the job status. Using --slurm-init-seconds-before-status-checks=<time in seconds> this behaviour can be altered.

Ordinary SMP jobs

Most jobs will be carried out by programs that are either single-core scripts or threaded programs, hence SMP (shared memory programs) in nature. Any given threads and mem_mb requirements will be passed to SLURM:

rule a:
    input: ...
    output: ...
    threads: 8
    resources:
        mem_mb=14000

This will give jobs from this rule 14GB of memory and 8 CPU cores. Using the SLURM executor plugin, we can alternatively define:

rule a:
    input: ...
    output: ...
    resources:
        cpus_per_task=8,
        mem_mb=14000

instead of the threads parameter. Parameters in the resources section will take precedence.

MPI jobs

Snakemake's SLURM backend also supports MPI jobs, see snakefiles-mpi{.interpreted-text role=”ref”} for details. When using MPI with SLURM, it is advisable to use srun as an MPI starter.

rule calc_pi:
  output:
      "pi.calc",
  log:
      "logs/calc_pi.log",
  resources:
      tasks=10,
      mpi="srun",
  shell:
      "{resources.mpi} -n {resources.tasks} calc-pi-mpi > {output} 2> {log}"

Note that the -n {resources.tasks} is not necessary in the case of SLURM, but it should be kept in order to allow execution of the workflow on other systems, e.g. by replacing srun with mpiexec:

$ snakemake --set-resources calc_pi:mpi="mpiexec" ...

To submit “ordinary” MPI jobs, submitting with tasks (the MPI ranks) is sufficient. Alternatively, on some clusters, it might be convenient to just configure nodes. Consider using a combination of tasks and cpus_per_task for hybrid applications (those that use ranks (multiprocessing) and threads). A detailed topology layout can be achieved using the slurm_extra parameter (see below) using further flags like --distribution.

Running Jobs locally

Not all Snakemake workflows are adapted for heterogeneous environments, particularly clusters. Users might want to avoid the submission of all rules as cluster jobs. Non-cluster jobs should usually include short jobs, e.g. internet downloads or plotting rules.

To label a rule as a non-cluster rule, use the localrules directive. Place it on top of a Snakefile as a comma-separated list like:

localrules: <rule_a>, <rule_b>

Advanced Resource Specifications

A workflow rule may support several resource specifications. For a SLURM cluster, a mapping between Snakemake and SLURM needs to be performed.

You can use the following specifications:

SLURM

Snakemake

Description

--partition

slurm_partition

the partition a rule/job is to use

--time

runtime

the walltime per job in minutes

--constraint

constraint

may hold features on some clusters

--mem

mem, mem_mb

memory a cluster node must

provide (mem: string with unit), mem_mb: i

--mem-per-cpu

mem_mb_per_cpu

memory per reserved CPU

--ntasks

tasks

number of concurrent tasks / ranks

--cpus-per-task

cpus_per_task

number of cpus per task (in case of SMP, rather use threads)

--nodes

nodes

number of nodes

--clusters

clusters

comma separated string of clusters

Each of these can be part of a rule, e.g.:

rule:
    input: ...
    output: ...
    resources:
        partition=<partition name>
        runtime=<some number>

Please note: as --mem and --mem-per-cpu are mutually exclusive on SLURM clusters, their corresponding resource flags mem/mem_mb and mem_mb_per_cpu are mutually exclusive, too. You can either reserve the memory a compute node has to provide(--mem flag) or the memory required per CPU (--mem-per-cpu flag). Depending on your cluster’s settings hyperthreads are enabled. SLURM does not make any distinction between real CPU cores and those provided by hyperthreads. SLURM will try to satisfy a combination of mem_mb_per_cpu and cpus_per_task and nodes if the nodes parameter is not given.

Note that it is usually advisable to avoid specifying SLURM (and compute infrastructure) specific resources (like constraint) inside your workflow because that can limit the reproducibility when executed on other systems. Consider using the --default-resources and --set-resources flags to specify such resources at the command line or (ideally) within a profile.

A sample configuration file as specified by the --workflow-profile flag might look like this:

default-resources:
    slurm_partition: "<your default partition>"

set-resources:
    <rulename>:
        slurm_partition: "<other partition>" # deviating partition for this rule
        runtime: 60 # 1 hour
        slurm_extra: "'--nice=150'"
        mem_mb_per_cpu: 1800
        cpus_per_task: 40

Multicluster Support

For reasons of scheduling multicluster support is provided by the clusters flag in resources sections. Note, that you have to write clusters, not cluster!

Additional Custom Job Configuration

SLURM installations can support custom plugins, which may add support for additional flags to sbatch. In addition, there are various batch options not directly supported via the resource definitions shown above. You may use the slurm_extra resource to specify additional flags to sbatch:

rule myrule:
    input: ...
    output: ...
    resources:
        slurm_extra="'--qos=long --mail-type=ALL --mail-user=<your email>'"

Again, rather use a profile to specify such resources.

Software Recommendations

While Snakemake mainly relies on Conda for reproducible execution, many clusters impose file number limits in their “HOME” directory. In this case, run mamba clean -a occasionally for persisting environments.

Note, snakemake --sdm conda ... works as intended.

To ensure that this plugin is working, install it in your base environment for the desired workflow.

HPC clusters provide so-called environment modules. Some clusters do not allow using Conda (and its derivatives). In this case, or when a particular software is not provided by a Conda channel, Snakemake can be instructed to use environment modules. The --sdm env-modules flag will trigger loading modules defined for a specific rule, e.g.:

rule ...:
   ...
   envmodules:
       "bio/VinaLC"

This will, internally, trigger a module load bio/VinaLC` immediately prior to execution.

Note, that

  • environment modules are best specified in a configuration file.

  • Using environment modules can be combined with conda and apptainer (--sdm env-modules conda apptainer), which will then be only used as a fallback for rules not defining environment modules. For running jobs, the squeue command:

Inquiring about Job Information and Adjusting the Rate Limiter

The executor plugin for SLURM uses unique job names to inquire about job status. It ensures inquiring about job status for the series of jobs of a workflow does not put too much strain on the batch system’s database. Human readable information is stored in the comment of a particular job. It is a combination of the rule name and wildcards. You can ask for it with the sacct or squeue commands, e.g.:

sacct -o JobID,State,Comment@

Note, the “@” after Comment ensures a width of 40 characters. This setting may be changed at will. If the width is too small, SLURM will abbreviate the column with a sign.

For running jobs, you can use the squeue command:

squeue -u $USER -o %i,%P,%.10j,%.40k

Here, the .<number> settings for the ID and the comment ensure a sufficient width, too.

Snakemake will check the status of your jobs 40 seconds after submission. Another attempt will be made in 10 seconds, then 20, etcetera with an upper limit of 180 seconds.

Using Profiles

When using profiles, a command line may become shorter. A sample profile could look like this:

executor: slurm
latency-wait: 60
default-storage-provider: fs
shared-fs-usage:
  - persistence
  - software-deployment
  - sources
  - source-cache
remote-job-local-storage-prefix: "<your node local storage prefix>"
local-storage-prefix: "<your local storage prefix, e.g. on login nodes>"

The entire configuration will set the executor to SLURM executor, ensures sufficient file system latency and allow automatic stage-in of files using the file system storage plugin.

On a cluster with a scratch directory per job id, the prefix within jobs might be:

remote-job-local-storage-prefix: "<scratch>/$SLURM_JOB_ID"

On a cluster with a scratch directory per user, the prefix within jobs might be:

remote-job-local-storage-prefix: "<scratch>/$USER"

Note, that the path <scratch> needs to be taken from a specific cluster documentation.

Further note, that you need to set the SNAKEMAKE_PROFILE environment variable in your ~/.bashrc file, e.g.:

export SNAKEMAKE_PROFILE="$HOME/.config/snakemake"

==This is ongoing development. Eventually you will be able to annotate different file access patterns.==

Retries - Or Trying again when a Job failed

Some cluster jobs may fail. In this case Snakemake can be instructed to try another submit before the entire workflow fails, in this example up to 3 times:

snakemake --retries=3

If a workflow fails entirely (e.g. when there are cluster failures), it can be resumed as any other Snakemake workflow:

snakemake --rerun-incomplete

To prevent failures due to faulty parameterization, we can dynamically adjust the runtime behaviour:

Dynamic Parameterization

Using dynamic parameterization we can react on different different inputs and prevent our HPC jobs from failing.

Input size of files may vary. If we have an estimate for the RAM requirement due to varying input file sizes, we can use this to dynamically adjust our jobs.

Runtime adjustments can be made in a Snakefile:

def get_time(wildcards, attempt):
    return f"{1 * attempt}h"

rule foo:
    input: ...
    output: ...
    resources:
        runtime=get_time
    ...

or in a workflow profile

set-resources:
    foo:
        runtime: f"{1 * attempt}h"

Be sure to use sensible settings for your cluster and make use of parallel execution (e.g. threads) and global profiles to avoid I/O contention.

Nesting Jobs (or Running this Plugin within a Job)

Some environments provide a shell within a SLURM job, for instance, IDEs started in on-demand context. If Snakemake attempts to use this plugin to spawn jobs on the cluster, this may work just as intended. Or it might not: depending on cluster settings or individual settings, submitted jobs may be ill-parameterized or will not find the right environment.

If the plugin detects to be running within a job, it will therefore issue a warning and stop for 5 seconds.

Retries - Or Trying again when a Job failed

Some cluster jobs may fail. In this case Snakemake can be instructed to try another submit before the entire workflow fails, in this example up to 3 times:

snakemake --retries=3

If a workflow fails entirely (e.g. when there are cluster failures), it can be resumed as any other Snakemake workflow:

snakemake --rerun-incomplete

To prevent failures due to faulty parameterization, we can dynamically adjust the runtime behaviour:

Dynamic Parameterization

Using dynamic parameterization we can react on different different inputs and prevent our HPC jobs from failing.

Input size of files may vary. If we have an estimate for the RAM requirement due to varying input file sizes, we can use this to dynamically adjust our jobs.

Runtime adjustments can be made in a Snakefile:

def get_time(wildcards, attempt):
    return f"{1 * attempt}h"

rule foo:
    input: ...
    output: ...
    resources:
        runtime=get_time
    ...

or in a workflow profile

set-resources:
    foo:
        runtime: f"{1 * attempt}h"

Be sure to use sensible settings for your cluster and make use of parallel execution (e.g. threads) and global profiles to avoid I/O contention.

Summary:

When put together, a frequent command line looks like:

$ snakemake --workflow-profile <path> \
> -j unlimited # assuming an unlimited number of jobs
> --default-resources  slurm_partition=<default partition> \
> --configfile config/config.yaml \
> --directory <path> # assuming a data path not relative to the workflow