Lesson 6 supplement (Swarm)
Swarm in Biowulf
In Biowulf, we can create a swarm script to help with parallelization of tasks. For instance, we can use a swarm script to download multiple sequencing data files from the NCBI SRA study Zaire ebolavirus sample sequencing from the 2014 outbreak in Sierra Leone, West Africa in parallel, rather than one file after another. Here, we will download the first 10000 reads the following sequencing data files in this study
- SRR1553606
- SRR1553416
- SRR1553417
- SRR1553418
- SRR1553419
While we can run five invidual fastq-dump
commands (one after another) in an interactive session to download sequencing data files, it would be easier to do this using the swarm script below where each of the fastq-dump
commands are run as an individual sub-job and thus, allowing us to download in parallel.
We can copy and paste the following swarm script into the nano editor, save it, and then submit the script as a job to Biowulf to accomplish our download. Save this script as SRP045416.swarm.
#SWARM --job-name SRP045416
#SWARM --sbatch "--mail-type=ALL --mail-user=wuz8@nih.gov"
#SWARM --gres=lscratch:15
#SWARM --module sratoolkit
fastq-dump --split-files -X 10000 SRR1553606
fastq-dump --split-files -X 10000 SRR1553416
fastq-dump --split-files -X 10000 SRR1553417
fastq-dump --split-files -X 10000 SRR1553418
fastq-dump --split-files -X 10000 SRR1553419
The first four lines in the script start with #SWARM are not run as part of the script and are directives for requesting resources on Biowulf. The four swarm directives are interpreted as below:
- --job-name
- assigns job name (ie. SRP045416)
- --sbatch "--mail-type=ALL --mail-user=wuz8@nih.gov"
- asks Biowulf to email all job notifications
- --gres=lscratch:15
- asks for 15 GB of local scratch space for temporary files
- --module sratoolkit
- loads the sratoolkit so we can run fastq-dump
We can request other compute resources for swarm jobs (see https://hpc.nih.gov/apps/swarm.html).
To submit the swarm script, we do
swarm -f SRP045416.swarm
Note that upon submitting of the swarm script, Biowulf assigns an overall job ID (58101981).
[wuz8@cn4269 wuz8]$ swarm -f SRP045416.swarm
58101981
We can use squeue
to check the status of this job. By doing so, we see that each of the fastq-dump
commands are run as a sub-job labeled as 58101981_0, 58101981_1, 58101981_2, 58101981_3, and 58101981_4. The number of sub-jobs reflex the number of commands in the swarm script. Another way of interpreting swarm is that it offers efficiency by enabling the submission of multiple jobs that run in parallel through the submission of one script to the Biowulf batch system.
[wuz8@cn4269 wuz8]$ squeue -u wuz8