Lesson 5: Submitting jobs to the Biowulf batch system

Learning objectives

After this lesson, participants will

Be able to describe shell and swarm scripts
Use the Nano editor to edit scripts
Submit shell and swarm scripts to the Biowulf batch system

Connecting to Biowulf

To get started, open the Command Prompt (Windows) or the Terminal (Mac) and connect to Biowulf. Remember you need to be connected to the NIH network either by being on campus or through VPN. Recall from lesson 1 that you use the ssh command below to connect to Biowulf, where username is the student account ID that was assigned to you (see student assignments). Remember that when prompted to enter your password, you are not going to be able to see it, but keep typing.

ssh username@biowulf.nih.gov

After connecting to Biowulf, change into the data directory. Again, replace username with the student account ID.

cd /data/username

Swarm scripts

In Biowulf, a swarm script can help with parallelization of tasks such as downloading multiple sequencing data files from the NCBI SRA study Zaire ebolavirus sample sequencing from the 2014 outbreak in Sierra Leone, West Africa in parallel, rather than one file after another. The example here will download the first 10000 reads the following sequencing data files in this study.

SRR1553606
SRR1553416
SRR1553417
SRR1553418
SRR1553419

Make a folder called SRP045416.

mkdir SRP045416

cd SRP045416

Create up a file called SRP045416.swarm in the nano editor

nano SRP045416.swarm

Copy and paste the following script into the editor.

#SWARM --job-name SRP045416
#SWARM --sbatch "--mail-type=ALL --mail-user=username@nih.gov"
#SWARM --partition=student
#SWARM --gres=lscratch:15 
#SWARM --module sratoolkit 

fastq-dump --split-files -X 10000 SRR1553606
fastq-dump --split-files -X 10000 SRR1553416
fastq-dump --split-files -X 10000 SRR1553417
fastq-dump --split-files -X 10000 SRR1553418
fastq-dump --split-files -X 10000 SRR1553419

In the swarm script above, the first four lines in the script start with #SWARM are not run as part of the script and are directives for requesting resources on Biowulf. The four swarm directives are interpreted as below:

--job-name: assigns job name (ie. SRP045416)
--sbatch: "--mail-type=ALL --mail-user=username@nih.gov" asks Biowulf to email all job notifications (replace username with NIH username)
--gres: asks for generic resource (ie. local temporary storage space of 15 gb by specifying lscratch:15)
--module: loads modules (ie. sratoolkit which houses fastq-dump for downloading sequencing data from the Sequence Read Archive)

After editing a file using nano, hit control-x to exit. When prompted to save, choose hit "y" to save.

To submit SRP045416.swarm

swarm -f SRP045416.swarm

Use sjob to check job status and resource allocation. Figure 1 shows the information provided by sjob when SRP045416.swarm was submitted.

sjobs

Some important columns in Figure 1 include the following.

JobID
St, which provides the job status
- R for running
- PD for pending
Walltime, which indicates how much time was allocated for the job
Number of CPUs and memory assigned

Note that the swarm script was assigned job ID 17387605 and there are five sub-jobs as indicated by [0-4], which concords with the five commands in the script.

"By default, each subjob is allocated a 1.5 gb of memory and 1 core (consisting of 2 cpus)." -- Biowulf swarm documentation

An advantage of using command line and scripting to analyze data is the ability to automate, which is desired when working with multiple input files such as fastq files derived from sequencing experiments. A bash script can help obtain stats using seqkit for the fastq files that were just downloaded. Create a script called SRP045416_stats.sh.

nano SRP045416_stats.sh

Copy and paste the following into the editor.

#!/bin/bash
#SBATCH --job-name=SRP045416_stats
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@nih.gov
#SBATCH --mem=1gb
#SBATCH --partition=student
#SBATCH --time=00:02:00
#SBATCH --output=SRR045416_stats_log

#LOAD REQUIRED MODULES
module load seqkit

#CREATE TEXT FILE TO STORE THE seqkit stat OUTPUT
touch SRP045416_stats.txt

#CREATE A FOR LOOP TO LOOP THROUGH THE FASTQ FILES AND GENERATE STATISTICS
#Use ">>" to redirect and append output to a file
for file in *.fastq;
do seqkit stat $file >> SRP045416_stats.txt;
done

Explanation of the SRP045416_stats.sh script.

Lines that start with "#" are comments and are not run as a part of the script
A shell script starts with #!/bin/bash, where "#!" is known as the sha-bang following "#!", is the path to the command interpreter (ie. /bin/bash)
Lines that start with #SBATCH are directives. Because these lines start with "#", they will not be run as a part of the script. However, these lines are important because they instruct Biowulf on when and where to send job notification as well as what resources need to be allocated.
- job-name: (name of the job)
- mail-type: (type of notification emails to receive, ALL will send all notifications including begin, end, cancel)
- mail-user: (where to send notification emails, replace with NIH email)
- mem: (RAM or memory required for the job)
- partition: (which partition to use; student accounts will need to use the student partition)
- time: (how much time should be alloted for the job, we want 10 minutes)
- output: (name of the log file)

To submit this script

sbatch SRP045416_stats.sh

To view the output file SRP045416_stats.txt

cat SRP045416_stats.txt

file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553416_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553416_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553417_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553417_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553418_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553418_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553419_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553419_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553606_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553606_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101