Lesson 5: Submitting jobs to the Biowulf batch system
Learning objectives
After this lesson, participants will
- Be able to describe shell and swarm scripts
- Use the Nano editor to edit scripts
- Submit shell and swarm scripts to the Biowulf batch system
Connecting to Biowulf
To get started, open the Command Prompt (Windows) or the Terminal (Mac) and connect to Biowulf. Remember you need to be connected to the NIH network either by being on campus or through VPN. Recall from lesson 1 that you use the ssh
command below to connect to Biowulf, where username is the student account ID that was assigned to you (see student assignments). Remember that when prompted to enter your password, you are not going to be able to see it, but keep typing.
ssh username@biowulf.nih.gov
After connecting to Biowulf, change into the data directory. Again, replace username with the student account ID.
cd /data/username
Swarm scripts
In Biowulf, a swarm script can help with parallelization of tasks such as downloading multiple sequencing data files from the NCBI SRA study Zaire ebolavirus sample sequencing from the 2014 outbreak in Sierra Leone, West Africa in parallel, rather than one file after another. The example here will download the first 10000 reads the following sequencing data files in this study.
- SRR1553606
- SRR1553416
- SRR1553417
- SRR1553418
- SRR1553419
Make a folder called SRP045416.
mkdir SRP045416
cd SRP045416
Create up a file called SRP045416.swarm in the nano editor
nano SRP045416.swarm
Copy and paste the following script into the editor.
#SWARM --job-name SRP045416
#SWARM --sbatch "--mail-type=ALL --mail-user=username@nih.gov"
#SWARM --partition=student
#SWARM --gres=lscratch:15
#SWARM --module sratoolkit
fastq-dump --split-files -X 10000 SRR1553606
fastq-dump --split-files -X 10000 SRR1553416
fastq-dump --split-files -X 10000 SRR1553417
fastq-dump --split-files -X 10000 SRR1553418
fastq-dump --split-files -X 10000 SRR1553419
In the swarm script above, the first four lines in the script start with #SWARM are not run as part of the script and are directives for requesting resources on Biowulf. The four swarm directives are interpreted as below:
--job-name
: assigns job name (ie. SRP045416)--sbatch
: "--mail-type=ALL --mail-user=username@nih.gov" asks Biowulf to email all job notifications (replace username with NIH username)--gres
: asks for generic resource (ie. local temporary storage space of 15 gb by specifying lscratch:15)--module
: loads modules (ie. sratoolkit which houses fastq-dump for downloading sequencing data from the Sequence Read Archive)
After editing a file using nano, hit control-x to exit. When prompted to save, choose hit "y" to save.
To submit SRP045416.swarm
swarm -f SRP045416.swarm
Use sjob to check job status and resource allocation. Figure 1 shows the information provided by sjob when SRP045416.swarm was submitted.
sjobs
Some important columns in Figure 1 include the following.
- JobID
- St, which provides the job status
- R for running
- PD for pending
- Walltime, which indicates how much time was allocated for the job
- Number of CPUs and memory assigned
Note that the swarm script was assigned job ID 17387605 and there are five sub-jobs as indicated by [0-4], which concords with the five commands in the script.
"By default, each subjob is allocated a 1.5 gb of memory and 1 core (consisting of 2 cpus)." -- Biowulf swarm documentation
An advantage of using command line and scripting to analyze data is the ability to automate, which is desired when working with multiple input files such as fastq files derived from sequencing experiments. A bash script can help obtain stats using seqkit for the fastq files that were just downloaded. Create a script called SRP045416_stats.sh.
nano SRP045416_stats.sh
Copy and paste the following into the editor.
#!/bin/bash
#SBATCH --job-name=SRP045416_stats
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@nih.gov
#SBATCH --mem=1gb
#SBATCH --partition=student
#SBATCH --time=00:02:00
#SBATCH --output=SRR045416_stats_log
#LOAD REQUIRED MODULES
module load seqkit
#CREATE TEXT FILE TO STORE THE seqkit stat OUTPUT
touch SRP045416_stats.txt
#CREATE A FOR LOOP TO LOOP THROUGH THE FASTQ FILES AND GENERATE STATISTICS
#Use ">>" to redirect and append output to a file
for file in *.fastq;
do seqkit stat $file >> SRP045416_stats.txt;
done
Explanation of the SRP045416_stats.sh script.
- Lines that start with "#" are comments and are not run as a part of the script
- A shell script starts with #!/bin/bash, where "#!" is known as the sha-bang following "#!", is the path to the command interpreter (ie. /bin/bash)
- Lines that start with #SBATCH are directives. Because these lines start with "#", they will not be run as a part of the script. However, these lines are important because they instruct Biowulf on when and where to send job notification as well as what resources need to be allocated.
- job-name: (name of the job)
- mail-type: (type of notification emails to receive, ALL will send all notifications including begin, end, cancel)
- mail-user: (where to send notification emails, replace with NIH email)
- mem: (RAM or memory required for the job)
- partition: (which partition to use; student accounts will need to use the student partition)
- time: (how much time should be alloted for the job, we want 10 minutes)
- output: (name of the log file)
To submit this script
sbatch SRP045416_stats.sh
To view the output file SRP045416_stats.txt
cat SRP045416_stats.txt
file format type num_seqs sum_len min_len avg_len max_len
SRR1553416_1.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553416_2.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553417_1.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553417_2.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553418_1.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553418_2.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553419_1.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553419_2.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553606_1.fastq FASTQ DNA 10,000 1,010,000 101 101 101
file format type num_seqs sum_len min_len avg_len max_len
SRR1553606_2.fastq FASTQ DNA 10,000 1,010,000 101 101 101