Interactive and Batch Jobs on Biowulf

Learning Objectives

After this class, the participant should have obtained the foundations for performing compute intensive tasks on Biowulf. Specifically, the participant should be able to

Describe Biowulf and high performance computing (HPC) systems
Work interactively on Biowulf compute nodes
Submit shell and swarm scripts to the Biowulf batch system

Background on Biowulf

What is Biowulf

Biowulf is the Unix/Linux-based high performance computing (HPC) system at NIH. With over 90,000 processor cores and 30 petabytes of storage space, it is much more powerful than a personal computer. As depicted in Figure 1, a high performance computing system is composed of a cluster of individual computers. Each computer has it's own processors, data storage, and random access memory (RAM). The individual computers are referred to as nodes. It is beneficial to know the terminologies in Figure 1 as these are encountered when working on Biowulf or other high performance computing systems.

Figure 1: Example of high performance computing structure.

How does Biowulf manage and allocate compute resources?

Biowulf and other HPC systems use SLURM (Simple Linux Utility for Resource Management) to manage and allocate computation resources.

Note

"Slurm allocates on the basis of cores." -- Biowulf swarm tutorial

Why use Biowulf?

Compute power
Many bioinformatics as well scientific software
System and software maintenance performed by Biowulf staff
Option for users to install their own software
Transfer large datasets

Interactive sessions

Upon connecting to Biowulf, the user lands in the login node.

Warning

Do not use the login node to perform any computation intensive tasks on Biowulf. Instead, we should either submit a job (if staying in the log in node) with sufficient resources requested or request an interactive session if we are going to be doing some testing and development.

The log in node is meant for the following (Source: Biowulf interactive jobs)

Submitting jobs (main purpose)
Editing/compiling code
File management
File transfer
Brief testing of code or debugging (about 20 minutes or less)

Interactive sessions allows users to work on Biowulf compute nodes and are suitable for

Testing/debugging cpu-intensive code
Pre/post-processing of data
Use of graphical application

Note

Always work in the data directory! Use pwd to check if the present working directory is /data/username, where username is user's Biowulf account ID. If not, then change into it by using cd /data/username.

To request an interactive session with default resource allocations use

sinteractive

Biowulf will search for available resources and allocate. The number 66874095 below is ID assigned specifically to this interactive job. If the user terminates the interactive session and requests a new one, a different job ID will be assigned to that interactive session.

salloc: Pending job allocation 66874095
salloc: job 66874095 queued and waiting for resources
salloc: job 66874095 has been allocated resources
salloc: Granted job allocation 66874095
salloc: Waiting for resource configuration
salloc: Nodes cn4271 are ready for job

To find how much resources have been allocated, use jobhist followed by the job id.

jobhist 66874095

Figure 2 shows an example of the output from jobhist command.

Figure 2: The default sinteractive allocation is 1 core (2 CPUs) and 0.768 GB/CPU (1.536 GB but rounded to 2 GB in the terminal) of memory and a walltime of 8 hours.

When done with the interactive session

exit

When done with Biowulf

exit

Options for sinteractive can be found at https://hpc.nih.gov/docs/userguide.html#int. The options include

--mem: to specify the amount of memory
--gres: to ask for generic resources like temporary storage space
--constraint: to specify the partition to use

"NCI-CCR has funded 153 nodes (4548 physical cores, 9096 cpus with hyperthreading) in the Biowulf cluster, and CCR users have priority access to these nodes." -- Biowulf NCI CCR partition

To use the ccr partition with an interactive session

sinteractive --constraint=ccr

Note

Each node in Biowulf has space that could be used to store temporary data (lscratch). These can be used for applications that write temporary files. To request lscratch space, include the --gres option in sinteractive.

Requesting an interactive session with

15 gb of temporary storage space
6 gb of memory

sinteractive --gres=lscratch:15 --mem=6gb

Note

Remember to use module load to activate a software before using it.

Download sequence data from the Sequence Read Archive using fastq-dump from the sratoolkit.

module load sratoolkit

The options and argument for the fastq-dump command below are

-X: input the number sequences to be downloaded from a FASTQ file, this example downloads the 200000 sequences.
--skip-technical: do not download any technical sequences.
--split-files: this option is for paired end sequencing and will download the forward and reverse reads into separate files.
Argument: the SRA sequence accession (ie. SRR23341296, which is derived from a RNA sequencing study)

fastq-dump -X 200000 --skip-technical --split-files SRR23341296

Load seqkit to get some statistics about the downloaded sequences using its stats module.

module load seqkit

seqkit stats SRR23341296*.fastq

file                 format  type  num_seqs     sum_len  min_len  avg_len  max_len
SRR23341296_1.fastq  FASTQ   DNA    200,000  30,000,000      150      150      150
SRR23341296_2.fastq  FASTQ   DNA    200,000  30,000,000      150      150      150

Use HISAT2, a splice aware aligner to map SRR23341296_1.fastq and SRR23341296_2.fastq to human genome. SRR23341296_1.fastq and SRR23341296_2.fastq are derived from RNA sequencing, thus require a splice aware aligner.

module load hisat

The options and arguments for the hisat2 command below are

-x: to specify the path of the reference genome (ie. $HISAT_INDEXES/grch38/genome)
-1: "Comma-separated list of files containing mate 1s (filename usually includes _1), e.g. -1 flyA_1.fq,flyB_1.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m2>. Reads may be a mix of different lengths. If - is specified, hisat2 will read the mate 1s from the “standard in” or “stdin” filehandle." -- HISATS2 manual
-2: "Comma-separated list of files containing mate 2s (filename usually includes _2), e.g. -2 flyA_2.fq,flyB_2.fq. Sequences specified with this option must correspond file-for-file and read-for-read with those specified in <m1>. Reads may be a mix of different lengths. If - is specified, hisat2 will read the mate 2s from the “standard in” or “stdin” filehandle." -- HISATS2 manual
-S: specify name of the alignment output
Arguments:
- Files to be aligned (SRR23341296_1.fastq and SRR23341296_2.fastq)
- Name of the alignment output (SRR23341296.sam)

hisat2 -x $HISAT_INDEXES/grch38/genome -1 SRR23341296_1.fastq -2 SRR23341296_2.fastq -S SRR23341296.sam

200000 reads; of these:
  200000 (100.00%) were paired; of these:
    13538 (6.77%) aligned concordantly 0 times
    178979 (89.49%) aligned concordantly exactly 1 time
    7483 (3.74%) aligned concordantly >1 times
    ----
    13538 pairs aligned concordantly 0 times; of these:
      763 (5.64%) aligned discordantly 1 time
    ----
    12775 pairs aligned 0 times concordantly or discordantly; of these:
      25550 mates make up the pairs; of these:
        17752 (69.48%) aligned 0 times
        6942 (27.17%) aligned exactly 1 time
        856 (3.35%) aligned >1 times
95.56% overall alignment rate

Find out job information such as resources allocated and used in the Biowulf user dashboard. See Figure 3 for an example of job information found in the user dashboard.

Figure 3: Users can find information regarding resources allocated and used in a job that was submitted to the Biowulf batch system via the user dashboard. The important statistics include memory allocated and used (top panel) as well as cpus allocated and used (middle panel). This example was derived from a hisat2 RNA sequencing alignment running on an interactive session with 4gb of memory requested (sinteractive --gres =lscratch:15 --mem=4gb). The memory usage was reaching the allocation limit and this alignment did not run successfully.

Submitting jobs to the Biowulf batch system

Users can test and run commands one by one on a interactive session. However, this is not practical and makes it hard for users to document analyses steps.

Shell script

A script allows users to string together analysis steps and keep them in one document. It can be reused with or without modification to analyze similar datasets.

The script (SRR23341296_statistics.sh) below will use seqkit stat to generate statistics for SRR23341296_1.fastq and SRR23341296_2.fastq. Important features of the shell scripts are listed below.

They have extension .sh
They start with #!/bin/bash
- "#!" is known as the sha-bang
- following #! is the path to the command interpreter (ie. /bin/bash)
Lines that begin with #SBATCH are not run as a part of the script but these are called SLURM directives, which provides critical information regarding the job submission, including
- Name of the job (--job-name)
- What types of job status notifications to send (--mail-type)
- Where to send job status notification (--mail-user)
- Memory to allocate (--mem)
- Time to allocate (--time)
- Name of the job output log (--ouput)
To add a comment, start a line with #. Comments are not run as a part of the script

Other options for SLURM directives can be found on the Biowulf website

#!/bin/bash
#SBATCH --job-name=SRR23341296_statistics
#SBATCH --mail-type=ALL
#SBATCH --mail-user=wuz8@nih.gov
#SBATCH --mem=2gb
#SBATCH --time=00:05:00
#SBATCH --output=SRR23341296_statistics_log

# Step 1: load modules

module load seqkit

# Step 2: get statistics for SRR23341296_1.fastq and SRR23341296_2.fastq
## A "for" loop is used to loop through fastq files with the SRR23341296 pattern
## The statistics for each fastq file will be written to SRR23341296_statistics.txt (note >> writes and appends to a file)

for file in SRR23341296*.fastq;
do
seqkit stat $file >> SRR23341296_statistics.txt;
done

To submit the script

sbatch SRR23341296_statistics.sh

The job ID assigned is 2763496

To check on job status use the jobhist command, followed by the job ID. Figure 4 shows the results of the jobhist output.

jobhist 2763496

Figure 4: The jobhist commands retrieve status as well as resource allocated for a job. Note the status of job is PENDING as indicated by the "State" column. Biowulf will update the job status to running or complete accordingly.

To cancel a job

scancel job_id

cat SRR23341296_statistics.txt

file                 format  type  num_seqs     sum_len  min_len  avg_len  max_len
SRR23341296_1.fastq  FASTQ   DNA    200,000  30,000,000      150      150      150
file                 format  type  num_seqs     sum_len  min_len  avg_len  max_len
SRR23341296_2.fastq  FASTQ   DNA    200,000  30,000,000      150      150      150

The shell script shown below (hbr_uhr.sh) analyzes RNA sequencing data derived from the Human Brain Reference (hbr) and Universal Human Reference (uhr) study. It aligns sequences to genome, calculates gene expression counts from the alignment results, and identifies differentially expressed genes and is an example of how a script can be used for multi-step analyses. In the script below, two additional sbatch directives were included.

--cpus-per-task: request a specified number of cpus for the job (6 in this case)
gres=lscratch:5: to request 5 gb of temporary storage space. The path to the temporary storage space was exported as the variable TMPDIR using export TMPDIR=/lscratch/$SLURM_JOB_ID, where $SLURM_JOB_ID is the variable that is storing the ID of a job.

#!/bin/bash
#SBATCH --job-name=hbr_uhr
#SBATCH --mail-type=ALL
#SBATCH --mail-user=wuz8@nih.gov
#SBATCH --mem=4gb
#SBATCH --cpus-per-task=6
#SBATCH --time=00:10:00
#SBATCH --output=hbr_uhr_log
#SBATCH --gres=lscratch:5

export TMPDIR=/lscratch/$SLURM_JOB_ID

## This script will perform differential experssion analysis on the HBR/UHR dataset

## Step 1: load modules
module load hisat
module load samtools
module load subread
module load R

## Step 2: create folder for output

mkdir hbr_uhr_output

## Step 3: align sequencing reads to genome using hisat2

cat hbr_uhr/reads/ids.txt | parallel -j 6  "hisat2 -x hbr_uhr/refs/22 -1 hbr_uhr/reads/{}_R1.fq -2 hbr_uhr/reads/{}_R2.fq -S hbr_uhr_output/{}.sam"

## Step 4: convert alignment ouput sam file to machine readable bam file

cat hbr_uhr/reads/ids.txt | parallel -j 6 "samtools sort -o hbr_uhr_output/{}.bam hbr_uhr_output/{}.sam"
cat hbr_uhr/reads/ids.txt | parallel -j 6 "samtools index -b hbr_uhr_output/{}.bam hbr_uhr_output/{}.bam.bai"

## Step 5: obtain expression counts table from alignment results

featureCounts -p -a hbr_uhr/refs/22.gtf -g gene_name -o hbr_uhr_output/hbr_uhr_counts.txt hbr_uhr_output/HBR_1.bam hbr_uhr_output/HBR_2.bam hbr_uhr_output/HBR_3.bam hbr_uhr_output/UHR_1.bam hbr_uhr_output/UHR_2.bam hbr_uhr_output/UHR_3.bam

## Step 6: wrangle the expression counts results to pass onto differential expression analysis

sed '1d' hbr_uhr_output/hbr_uhr_counts.txt | cut -f1,7-12 | awk '{gsub(/hbr_uhr_output\//,""); print}' | tr '\t' ',' > hbr_uhr_output/counts.csv

## Step 7: perform differential expression analysis

cp hbr_uhr/design.csv ./hbr_uhr_output/design.csv

cd hbr_uhr_output

Rscript ../code/deseq2.r

To submit the script

sbatch hbr_uhr.sh

Swarm

Swarm is a way to submit multiple commands to the Biowulf batch system and each command will be run as an independent job, allowing for parallelization. Below is an example of swarm script (srp045416_download.swarm) that uses fastq-dump from the sratoolkit to download next generation sequencing data from the Sequence Read Archive (SRA).

Swarm scripts have extension *.swarm
Lines that start with #SWARM are not run as a part of the script but these are directives that tell the Biowulf batch system what resources (ie. memory, time, temporary storage, modules) are needed

#SWARM --job-name SRP045416
#SWARM --sbatch "--mail-type=ALL --mail-user=wuz8@nih.gov"
#SWARM --gres=lscratch:15 
#SWARM --module sratoolkit 

fastq-dump --split-files -X 10000 SRR1553606 -O SRP045416_fastq
fastq-dump --split-files -X 10000 SRR1553416 -O SRP045416_fastq
fastq-dump --split-files -X 10000 SRR1553417 -O SRP045416_fastq
fastq-dump --split-files -X 10000 SRR1553418 -O SRP045416_fastq
fastq-dump --split-files -X 10000 SRR1553419 -O SRP045416_fastq

Before submitting the swarm script, make a new directory called SRP045416_fastq to store the FASTQ files.

mkdir SRP045416_fastq

To submit a swarm script, use the swarm command where -f prompts for the name of the script.

swarm -f srp045416_download.swarm

The ID assigned to the swarm job is

To check on job status (see Figure 5 for output).

jobhist 2755884

Figure 5. jobhist output for the swarm script. Note that this screen capture was taken when the job was completed as indicated by the COMPLETED status in the "State" column. This swarm script created five sub-jobs labeled "2755884_01" thru "2755884_4" (see Jobid column). Each of the sub-jobs was assigned 2 gb of memory and 2 cpus.

Summary

This session has provided the audience with the basics of utilizing Biowulf's compute resources. The audience should now be able to work interactively on Biwoulf's compute nodes or submit batch jobs. The information presented here really just scratched the surface and below are more ways for the audience to learn more or to obtain help.

Biowulf training on YouTube

Interactive sessions Swarm Batch jobs

Biowulf user guide

Biowulf monthly Zoom consults

Biowulf training and Zoom consults

Biowulf batch job coding club: scripts and data

To obtain the scripts and data used for this coding club, sign onto Biowulf and change into the data directory. Then

wget https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/batch_job_coding_club.zip

mkdir batch_job_coding_club

unzip batch_job_coding_club.zip -d batch_job_coding_club

cd batch_job_coding_club

Everything should be in the /data/username/batch_job_coding_club directory, where username is the user's Biowoulf account ID.