Lesson 4 (Working with bioinformatics software on Biowulf)

Lesson 4: Working with bioinformatics software on Biowulf

Learning objectives

After this lesson, we participants will

Know how to request an interactive session on Biowulf
Know how to software that are installed on Biowulf
Be able to sign onto Helix and download sequencing data from SRA.
Load software that are installed on Biowulf and become familiar with running some bioinformatics applications using Unix command line

Connecting to Biowulf

To get started, open the Command Prompt (Windows) or the Terminal (Mac) and connect to Biowulf. Remember you need to be connected to the NIH network either by being on campus or through VPN. Recall from lesson 1 that you use the ssh command below to connect to Biowulf, where username is the student account ID that was assigned to you (see student assignments). Remember that when prompted to enter your password, you are not going to be able to see it, but keep typing.

ssh username@biowulf.nih.gov

Requesting an interactive session

Recall

The Biowulf login node is meant for job submission to the batch system and should not be used to perform any computation intensive tasks. For testing computation intensive tasks without submitting a job, request an interactive session to work on one of Biowulf's compute nodes.

To request an interactive session do the following.

sinteractive

salloc: Pending job allocation 17385251
salloc: job 17385251 queued and waiting for resources
salloc: job 17385251 has been allocated resources
salloc: Granted job allocation 17385251
salloc: Waiting for resource configuration
salloc: Nodes cn4298 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
error: unable to open file /tmp/slurm-spank-x11.17385251.0
slurmstepd: error: x11: unable to read DISPLAY value

Note

The number 17385251 in the sinteractive output is the job ID. This important because users can reference it to view job details and cancel jobs if submitting to the batch system.

Important

The prompt changes to username@cn#### from username@biowulf when successfully connected to an interactive session, where "cn####" is the name of one of the Biowulf compute nodes.

Above, sinteractive was run without options (ie. with the defaults).

jobhist 17385251

JobId              : 17385251
User               : wuz8
Submitted          : 20240118 17:33:17
Started            : 20240118 17:33:25
Ended              : 

Jobid        Partition       State  Nodes  CPUs      Walltime       Runtime         MemReq  MemUsed  Nodelist
17385251    interactive     RUNNING      1     2       8:00:00          9:05            2GB      3MB  cn4298

Note

The default sinteractive allocation is 1 core (2 CPUs) and 0.768 GB/CPU of memory and a walltime of 8 hours. While the MemReq shows 2 GB of RAM was requested, it is actually 1.5 GB of RAM (0.768 GB x 2 CPU). Biowulf just rounded to the nearest integer.

Partitions

"Partitions define limitations that restrict the resources that can be requested for a job submitted to that partition. The limitations affect the maximum run time, the amount of memory, and the number of available CPU cores (which are called CPUs in Slurm)." -- https://wiki.hpcuser.uni-oldenburg.de/index.php?title=Partitions. "Jobs should be submitted to the partition that best matches the required resources." -- https://wiki.hpcuser.uni-oldenburg.de/index.php?title=Partitions.

NCI-CCR, NHLBI and NINDS, and NIHM have buy-in nodes (partitions). To request an interactive session for the NCI-CCR partition, use sinteractive --constraint=ccr. See the following links from Biowulf regarding the buy-in nodes.

NCI-CCR: https://hpc.nih.gov/docs/ccr.html
NHLBI and NINDS: https://hpc.nih.gov/docs/forgo.html
NIMH: https://hpc.nih.gov/docs/nimh.html

Software on Biowulf

Biowulf staff has installed many applications, including those used in genomic data analysis. In general, to view the applications that are available on Biowulf, we can use the module command, with its avail subcommand. This will essentially print out a list of applications that are on Biowulf and we can use the up and down arrows to navigate and view the list. We hit "q" to exit this list.

module avail

To list only the default version of each application, include the -d option in module avail.

module -d avail

To check if a specific application is available, you can append the name of the module after `module avail. For instance, we do that with the genomic sequencing Star aligner Bowtie below.

module avail star

We can use the whatis subcommand to see information regarding a specific tool and also to confirm if Biowulf has it installed. For instance, we can check for fastqc, which is an application used to assess quality of high throughput sequencing data. The output provides a description of what the tool does and the default version if we load the tool. The whatis subcommand is case sensitive.

module whatis fastqc

To load an application we can use module load. Let's load the sratool kit and fastqc. By default, the latest version of an application is loaded.

module load fastqc

[+] Loading fastqc  0.11.9

module load sratoolkit

The following error is obtained when loading sratoolkit. This is triggered because sratoolkit write temporary files and requires local temporary storage space.

Lmod has detected the following error: 

This module requires allocation of /lscratch. Please see

   https://hpc.nih.gov/docs/userguide.html#local

or contact staff@hpc.nih.gov for more information.

While processing the following module(s):
    Module fullname    Module Filename
    ---------------    ---------------
    sratoolkit/3.0.10  /usr/local/lmod/modulefiles/sratoolkit/3.0.10.lua

To resolve the above issue with loading sratoolkit, exit the interactive session.

exit

srun: error: cn4298: task 0: Exited with exit code 1
salloc: Relinquishing job allocation 17385251
salloc: Job allocation 17385251 has been revoked.

Request another interactive session with local temporary storage on the assigned Biowulf compute node. To this append the --gres option to sinteractive, where gres stands for generic resources. Set gres to lscratch (ie. local temporary storage) and indicate the size in gigabytes. For instance, the command construct below asks for 10 gigabytes of local temporary storage.

sinteractive --gres=lscratch:15

The module load sratoolkit and module load fastqc.

Exploring bioinformatics tools

Here, we will download some high throughput genomic sequences from NCBI SRA. The data that we will download were derived from sequencing of the Zaire Ebola virus. See the NCBI SRA page for this study for more details. For this part of the exercise, sign onto Helix.

Start a new Terminal (Mac) or Command Prompt (Window).

ssh username@Helix.nih.gov

We will use a command called `fastq-dump`` within the sratoolkit to grab the first 10000 reads for this sequencing run. In the syntax for fastq-dump

--split-files will generate two files that contains the forward and reverse reads from paired-end sequencing.
-X allows us to input how many reads we want to obtain (here, we just want the first 10000 reads to save time and computation resources for this class)
Finally, we enter the SRA accession number of the sequencing data that we want to download (SRR1553606 in this example).
Create a directory called SRR1553606 to store the sequencing data.

mkdir SRR1553606

cd /data/username/SRR1553606

Module load sratoolkit.

module load sratoolkit

fastq-dump --split-files -X 10000 SRR1553606

After download has completed, there should be two fastq files.

ls

SRR1553606_1.fastq  SRR1553606_2.fastq

Go back to the Terminal or Command Prompt with the Biowulf interactive session and stay in the /data/username/SRR1553606 folder.

The first task in analyzing high throughput sequencing data is to perform quality check using tools such as FASTQC. Look at the help documents to learn how to run FASTQC.

fastqc --help

The command construct starts with fastqc followed by the arguments, which are the files that the user wants to perform quality check on.

fastqc seqfile1 seqfile2 .. seqfileN

fastqc SRR1553606_1.fastq SRR1553606_2.fastq

Listin the content of the directory will reveal the FASTQC results in html and zip format. The html file can be viewed locally in a web browser while the zip file when expanded contains text summaries and individual quality metric images presented in the html file.

SRR1553606_1.fastq
SRR1553606_1_fastqc.html
SRR1553606_1_fastqc.zip
SRR1553606_2.fastq
SRR1553606_2_fastqc.html
SRR1553606_2_fastqc.zip
SRR1553606_fastqc_log
SRR1553606_fastqc.sh

Seqkit is a package that enables users to find and work with sequences. The stats function can be used to obtain fastq file statistics.

module load seqkit

seqkit stats SRR1553606_1.fastq SRR1553606_2.fastq

file                format  type  num_seqs    sum_len  min_len  avg_len  max_len
SRR1553606_1.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101
SRR1553606_2.fastq  FASTQ   DNA     10,000  1,010,000      101      101      101