Lesson 4: Biowulf modules, swarm, and batch jobs

Quick review

The previous lessons have taught participants how to connect to Biowulf and navigagte through the environment.

Learning objectives

After this lesson, participants should be able to

Find bioinformatics applications that are installed on Biowulf
Load applications that are installed on Biowulf
Describe the Biowulf batch system
Use nano to edit files
Use swarm to submit a group of commands to the Biowulf batch system
Submit a script to the Biowlf batch system

Commands that will be discussed

module avail: list available applications on Biowulf
module spider: list available applications on Biowulf
module what is: get application description
module load: load an application
nano: open the Unix text editor to edit files
touch: create a blank text file

Before getting started

Sign onto Biowulf using the assigned student account. Remember, Windows users will need to open the Command Prompt and Mac users will need to open the Terminal. Also remember to connect to the NIH network either by being on campus or through VPN before attempting to sign in. The command to sign in to Biowulf is below, where username should be replaced by the student ID.

ssh username@biowulf.nih.gov

See here for student account assignment. Enter NIH credentials to see the student account assignment sheet after clicking the link.

After connecting to Biowulf, change into the data directory. Again, replace username with the student account ID.

cd /data/username

Biowulf does not keep data in the student accounts after class, so copy the folder unix_on_biowulf_2023_documents in /data/classes/BTEP to the present working directory, which should be /data/username.

cp -r /data/classes/BTEP/unix_on_biowulf_2023_documents .

Change into unix_on_biowulf_2023_documents.

cd unix_on_biowulf_2023_documents

Bioinformatics applications on Biowulf

Biowulf houses thousands of applications. To get a list of applications that are available on Biowulf use the module command its avail subcommand.

module avail

Use the up and down arrows keys to scroll through the list or use the space bar to scroll one page at a time. Hit q to exit the modules list and return to the prompt.

module spider also lists applications but displays results in a different format. Hit q to exit module spider.

Use module spider followed by the application name to search for a specific application. For instance, fastqc, which is used to assess quality of Next Generation Sequencing data.

module spider fastqc

Biowulf keeps the current version and previous versions of an application. The default is to load the current version. By default, Biowulf loads the latest version of a tool.

-------------------------------------------------------------------------------------------
  fastqc:
-------------------------------------------------------------------------------------------
     Versions:
        fastqc/0.11.8
        fastqc/0.11.9

-------------------------------------------------------------------------------------------
  For detailed information about a specific "fastqc" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider fastqc/0.11.9
-------------------------------------------------------------------------------------------

To find out how to load FASTQC

module spider fastqc/0.11.9

-----------------------------------------------------------------------------------------------------------------------
  fastqc: fastqc/0.11.9
-----------------------------------------------------------------------------------------------------------------------

    This module can be loaded directly: module load fastqc/0.11.9

    Help:
      This module sets up the environment for using fastqc.

To find out what FASTQC does

module whatis fastqc

fastqc/0.11.9       : fastqc: It provide quality control functions to next gen sequencing data.
fastqc/0.11.9       : Version: 0.11.9

Working with Biowulf bioinformatics applications

This exercise will demonstrate how to use a Biowulf bioinformatics application called seqkit. The skills can be used for running other applications that are available on Biowulf.

What is seqkit?

module whatis seqkit

seqkit/2.1.0        : A cross-platform and ultrafast toolkit for FASTA/Q file manipulation in Golang

Before doing anything computationally intensive, request an interactive session.

sinteractive

Load seqkit or any other tool (tools will not load in the login node)

module load seqkit

Change into the SRR1553606 directory

cd SRR1553606

ls

There are two Next Generation Sequencing fastq files in this folder

SRR1553606_1.fastq  SRR1553606_2.fastq

Use the stat subcommand of seqkit to get some statistics about the SRR1553606_1.fastq.

seqkit stat SRR1553606_1.fastq

file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553606_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101

Convert SRR1553606_1.fastq to fasta using the fq2fa subcommand of seqkit.

seqkit fq2fa SRR1553606_1.fastq

>SRR1553606.1 1 length=101
ATACACATCTCCGAGCCCACGAGACCTCTCTACATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACAGGAGTCGCCCAGCCCTGCTCAACGAGCTGCAG
>SRR1553606.2 2 length=101
CAACAACAACACTCATCACCAAGATACCGGAGAAGAGAGTGCCAGCAGCGGGAAGCTAGGCTTAATTACCAATACTATTGCTGGAGTAGCAGGACTGATCA

Submitting jobs to the Biowulf batch system

For this portion of the class, change back to the /data/username folder

cd /data/username

Then make a new directory called SRP045416 and change into it.

mkdir SRP045416

cd SRP045416

In Biowulf, a swarm script can help with parallelization of tasks such as downloading multiple sequencing data files from the NCBI SRA study Zaire ebolavirus sample sequencing from the 2014 outbreak in Sierra Leone, West Africa in parallel, rather than one file after another. The example here will download the first 10000 reads the following sequencing data files in this study.

SRR1553606
SRR1553416
SRR1553417
SRR1553418
SRR1553419

Create up a file called SRP045416.swarm in the nano editor

nano SRP045416.swarm

Copy and paste the following script into the editor.

#SWARM --job-name SRP045416
#SWARM --sbatch "--mail-type=ALL --mail-user=username@nih.gov"
#SWARM --partition=student
#SWARM --gres=lscratch:15 
#SWARM --module sratoolkit 

fastq-dump --split-files -X 10000 SRR1553606
fastq-dump --split-files -X 10000 SRR1553416
fastq-dump --split-files -X 10000 SRR1553417
fastq-dump --split-files -X 10000 SRR1553418
fastq-dump --split-files -X 10000 SRR1553419

In the swarm script above, the first four lines in the script start with #SWARM are not run as part of the script and are directives for requesting resources on Biowulf. The four swarm directives are interpreted as below:

--job-name
- assigns job name (ie. SRP045416)
--sbatch "--mail-type=ALL --mail-user=username@nih.gov"
- asks Biowulf to email all job notifications (replace username with NIH username)
--gres
- asks for generic resource (ie. local temporary storage space of 15 gb by specifying lscratch:15)
--module
- loads modules (ie. sratoolkit which houses fastq-dump for downloading sequencing data from the Sequence Read Archive)

After editing a file using nano, hit control-x to exit. When prompted to save, choose hit "y" to save.

To submit SRP045416.swarm

swarm -f SRP045416.swarm

Use sjob to check job status and resource allocation. Figure 1 shows the information provided by sjob when SRP045416.swarm was submitted.

sjobs

Some important columns in Figure 1 include the following.

JobID
St, which provides the job status
- R for running
- PD for pending
Walltime, which indicates how much time was allocated for the job
Number of CPUs and memory assigned

Note that the swarm script was assigned job ID 1436172 and there are five sub-jobs as indicated by [0-4], which concords with the five commands in the script. Biowulf assigned 5 cpus (see cpus queued) and 7.5 gb of memory or 1.5 gb per sub-job (see mem queued) for the swarm script.

Figure 1: Use sjobs to check status and resource allocation after submitting a job to Biowulf.

After the swarm script finishes, use ls to list the contents of the directory. Use the -1 option to show one item per line.

ls -1

There are swarm log files with .e and .o extensions. Importantly, the fastq files were downloaded.

SRP045416_65452913_0.e
SRP045416_65452913_0.o
SRP045416_65452913_1.e
SRP045416_65452913_1.o
SRP045416_65452913_2.e
SRP045416_65452913_2.o
SRP045416_65452913_3.e
SRP045416_65452913_3.o
SRP045416_65452913_4.e
SRP045416_65452913_4.o
SRP045416.swarm
SRR1553416_1.fastq
SRR1553416_2.fastq
SRR1553417_1.fastq
SRR1553417_2.fastq
SRR1553418_1.fastq
SRR1553418_2.fastq
SRR1553419_1.fastq
SRR1553419_2.fastq
SRR1553606_1.fastq

An advantage of using command line and scripting to analyze data is the ability to automate, which is desired when working with multiple input files such as fastq files derived from sequencing experiments. A bash script can help obtain stats using seqkit for the fastq files that were just downloaded. Create a script called SRP045416_stats.sh.

nano SRP045416_stats.sh

Copy and paste the following into the editor.

#!/bin/bash
#SBATCH --job-name=SRP045416_stats
#SBATCH --mail-type=ALL
#SBATCH --mail-user=username@nih.gov
#SBATCH --mem=1gb
#SBATCH --partition=student
#SBATCH --time=00:02:00
#SBATCH --output=SRR045416_stats_log

#LOAD REQUIRED MODULES
module load seqkit

#CREATE TEXT FILE TO STORE THE seqkit stat OUTPUT
touch SRP045416_stats.txt

#CREATE A FOR LOOP TO LOOP THROUGH THE FASTQ FILES AND GENERATE STATISTICS
#Use ">>" to redirect and append output to a file
for file in *.fastq;
do seqkit stat $file >> SRP045416_stats.txt;
done

To submit this script

sbatch SRP045416_stats.sh

cat SRP045416_stats.txt

file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553416_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553416_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553417_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553417_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553418_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553418_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553419_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553419_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553606_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553606_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101

Explanation of the SRP045416_stats.sh script.

Lines that start with "#" are comments and are not run as a part of the script
A shell script starts with #!/bin/bash, where "#!" is known as the sha-bang following "#!", is the path to the command interpreter (ie. /bin/bash)
Lines that start with #SBATCH are directives. Because these lines start with "#", they will not be run as a part of the script. However, these lines are important because they instruct Biowulf on when and where to send job notification as well as what resources need to be allocated.
- job-name: (name of the job)
- mail-type: (type of notification emails to receive, ALL will send all notifications including begin, end, cancel)
- mail-user: (where to send notification emails, replace with NIH email)
- mem: (RAM or memory required for the job)
- partition: (which partition to use; student accounts will need to use the student partition)
- time: (how much time should be alloted for the job, we want 10 minutes)
- output: (name of the log file)