Lesson 6 supplement (Swarm)

Swarm in Biowulf

In Biowulf, we can create a swarm script to help with parallelization of tasks. For instance, we can use a swarm script to download multiple sequencing data files from the NCBI SRA study Zaire ebolavirus sample sequencing from the 2014 outbreak in Sierra Leone, West Africa in parallel, rather than one file after another. Here, we will download the first 10000 reads the following sequencing data files in this study

SRR1553606
SRR1553416
SRR1553417
SRR1553418
SRR1553419

While we can run five invidual fastq-dump commands (one after another) in an interactive session to download sequencing data files, it would be easier to do this using the swarm script below where each of the fastq-dump commands are run as an individual sub-job and thus, allowing us to download in parallel.

We can copy and paste the following swarm script into the nano editor, save it, and then submit the script as a job to Biowulf to accomplish our download. Save this script as SRP045416.swarm.

#SWARM --job-name SRP045416
#SWARM --sbatch "--mail-type=ALL --mail-user=wuz8@nih.gov"
#SWARM --gres=lscratch:15 
#SWARM --module sratoolkit 

fastq-dump --split-files -X 10000 SRR1553606
fastq-dump --split-files -X 10000 SRR1553416
fastq-dump --split-files -X 10000 SRR1553417
fastq-dump --split-files -X 10000 SRR1553418
fastq-dump --split-files -X 10000 SRR1553419

The first four lines in the script start with #SWARM are not run as part of the script and are directives for requesting resources on Biowulf. The four swarm directives are interpreted as below:

--job-name
- assigns job name (ie. SRP045416)
--sbatch "--mail-type=ALL --mail-user=wuz8@nih.gov"
- asks Biowulf to email all job notifications
--gres=lscratch:15
- asks for 15 GB of local scratch space for temporary files
--module sratoolkit
- loads the sratoolkit so we can run fastq-dump

We can request other compute resources for swarm jobs (see https://hpc.nih.gov/apps/swarm.html).

To submit the swarm script, we do

swarm -f SRP045416.swarm

Note that upon submitting of the swarm script, Biowulf assigns an overall job ID (58101981).

[wuz8@cn4269 wuz8]$ swarm -f SRP045416.swarm
58101981

We can use squeue to check the status of this job. By doing so, we see that each of the fastq-dump commands are run as a sub-job labeled as 58101981_0, 58101981_1, 58101981_2, 58101981_3, and 58101981_4. The number of sub-jobs reflex the number of commands in the swarm script. Another way of interpreting swarm is that it offers efficiency by enabling the submission of multiple jobs that run in parallel through the submission of one script to the Biowulf batch system.

[wuz8@cn4269 wuz8]$ squeue -u wuz8