Lesson 4 (Working with bioinformatics software on Biowulf)
Lesson 4: Working with bioinformatics software on Biowulf
Learning objectives
After this lesson, we participants will
- Know how to request an interactive session on Biowulf
- Know how to software that are installed on Biowulf
- Be able to sign onto Helix and download sequencing data from SRA.
- Load software that are installed on Biowulf and become familiar with running some bioinformatics applications using Unix command line
Connecting to Biowulf
To get started, open the Command Prompt (Windows) or the Terminal (Mac) and connect to Biowulf. Remember you need to be connected to the NIH network either by being on campus or through VPN. Recall from lesson 1 that you use the ssh
command below to connect to Biowulf, where username is the student account ID that was assigned to you (see student assignments). Remember that when prompted to enter your password, you are not going to be able to see it, but keep typing.
ssh username@biowulf.nih.gov
Requesting an interactive session
Recall
The Biowulf login node is meant for job submission to the batch system and should not be used to perform any computation intensive tasks. For testing computation intensive tasks without submitting a job, request an interactive session to work on one of Biowulf's compute nodes.
To request an interactive session do the following.
sinteractive
salloc: Pending job allocation 17385251
salloc: job 17385251 queued and waiting for resources
salloc: job 17385251 has been allocated resources
salloc: Granted job allocation 17385251
salloc: Waiting for resource configuration
salloc: Nodes cn4298 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
error: unable to open file /tmp/slurm-spank-x11.17385251.0
slurmstepd: error: x11: unable to read DISPLAY value
Note
The number 17385251 in the sinteractive
output is the job ID. This important because users can reference it to view job details and cancel jobs if submitting to the batch system.
Important
The prompt changes to username@cn#### from username@biowulf when successfully connected to an interactive session, where "cn####" is the name of one of the Biowulf compute nodes.
Above, sinteractive
was run without options (ie. with the defaults).
jobhist 17385251
JobId : 17385251
User : wuz8
Submitted : 20240118 17:33:17
Started : 20240118 17:33:25
Ended :
Jobid Partition State Nodes CPUs Walltime Runtime MemReq MemUsed Nodelist
17385251 interactive RUNNING 1 2 8:00:00 9:05 2GB 3MB cn4298
Note
The default sinteractive allocation is 1 core (2 CPUs) and 0.768 GB/CPU of memory and a walltime of 8 hours. While the MemReq shows 2 GB of RAM was requested, it is actually 1.5 GB of RAM (0.768 GB x 2 CPU). Biowulf just rounded to the nearest integer.
Partitions
"Partitions define limitations that restrict the resources that can be requested for a job submitted to that partition. The limitations affect the maximum run time, the amount of memory, and the number of available CPU cores (which are called CPUs in Slurm)." -- https://wiki.hpcuser.uni-oldenburg.de/index.php?title=Partitions. "Jobs should be submitted to the partition that best matches the required resources." -- https://wiki.hpcuser.uni-oldenburg.de/index.php?title=Partitions.
NCI-CCR, NHLBI and NINDS, and NIHM have buy-in nodes (partitions). To request an interactive session for the NCI-CCR partition, use sinteractive --constraint=ccr
. See the following links from Biowulf regarding the buy-in nodes.
- NCI-CCR: https://hpc.nih.gov/docs/ccr.html
- NHLBI and NINDS: https://hpc.nih.gov/docs/forgo.html
- NIMH: https://hpc.nih.gov/docs/nimh.html
Software on Biowulf
Biowulf staff has installed many applications, including those used in genomic data analysis. In general, to view the applications that are available on Biowulf, we can use the module command, with its avail subcommand. This will essentially print out a list of applications that are on Biowulf and we can use the up and down arrows to navigate and view the list. We hit "q" to exit this list.
module avail
To list only the default version of each application, include the -d option in module avail.
module -d avail
To check if a specific application is available, you can append the name of the module after `module avail. For instance, we do that with the genomic sequencing Star aligner Bowtie below.
module avail star
We can use the whatis subcommand to see information regarding a specific tool and also to confirm if Biowulf has it installed. For instance, we can check for fastqc, which is an application used to assess quality of high throughput sequencing data. The output provides a description of what the tool does and the default version if we load the tool. The whatis subcommand is case sensitive.
module whatis fastqc
To load an application we can use module load. Let's load the sratool kit and fastqc. By default, the latest version of an application is loaded.
module load fastqc
[+] Loading fastqc 0.11.9
module load sratoolkit
The following error is obtained when loading sratoolkit. This is triggered because sratoolkit write temporary files and requires local temporary storage space.
Lmod has detected the following error:
This module requires allocation of /lscratch. Please see
https://hpc.nih.gov/docs/userguide.html#local
or contact staff@hpc.nih.gov for more information.
While processing the following module(s):
Module fullname Module Filename
--------------- ---------------
sratoolkit/3.0.10 /usr/local/lmod/modulefiles/sratoolkit/3.0.10.lua
To resolve the above issue with loading sratoolkit, exit the interactive session.
exit
srun: error: cn4298: task 0: Exited with exit code 1
salloc: Relinquishing job allocation 17385251
salloc: Job allocation 17385251 has been revoked.
Request another interactive session with local temporary storage on the assigned Biowulf compute node. To this append the --gres
option to sinteractive
, where gres
stands for generic resources. Set gres
to lscratch
(ie. local temporary storage) and indicate the size in gigabytes. For instance, the command construct below asks for 10 gigabytes of local temporary storage.
sinteractive --gres=lscratch:15
The module load sratoolkit
and module load fastqc
.
Exploring bioinformatics tools
Here, we will download some high throughput genomic sequences from NCBI SRA. The data that we will download were derived from sequencing of the Zaire Ebola virus. See the NCBI SRA page for this study for more details. For this part of the exercise, sign onto Helix.
Start a new Terminal (Mac) or Command Prompt (Window).
ssh username@Helix.nih.gov
We will use a command called `fastq-dump`` within the sratoolkit to grab the first 10000 reads for this sequencing run. In the syntax for fastq-dump
--split-files
will generate two files that contains the forward and reverse reads from paired-end sequencing.-X
allows us to input how many reads we want to obtain (here, we just want the first 10000 reads to save time and computation resources for this class)- Finally, we enter the SRA accession number of the sequencing data that we want to download (SRR1553606 in this example).
- Create a directory called SRR1553606 to store the sequencing data.
mkdir SRR1553606
cd /data/username/SRR1553606
Module load sratoolkit.
module load sratoolkit
fastq-dump --split-files -X 10000 SRR1553606
After download has completed, there should be two fastq files.
ls
SRR1553606_1.fastq SRR1553606_2.fastq
Go back to the Terminal or Command Prompt with the Biowulf interactive session and stay in the /data/username/SRR1553606 folder.
The first task in analyzing high throughput sequencing data is to perform quality check using tools such as FASTQC. Look at the help documents to learn how to run FASTQC.
fastqc --help
The command construct starts with fastqc
followed by the arguments, which are the files that the user wants to perform quality check on.
fastqc seqfile1 seqfile2 .. seqfileN
fastqc SRR1553606_1.fastq SRR1553606_2.fastq
Listin the content of the directory will reveal the FASTQC results in html and zip format. The html file can be viewed locally in a web browser while the zip file when expanded contains text summaries and individual quality metric images presented in the html file.
SRR1553606_1.fastq
SRR1553606_1_fastqc.html
SRR1553606_1_fastqc.zip
SRR1553606_2.fastq
SRR1553606_2_fastqc.html
SRR1553606_2_fastqc.zip
SRR1553606_fastqc_log
SRR1553606_fastqc.sh
Seqkit is a package that enables users to find and work with sequences. The stats
function can be used to obtain fastq file statistics.
module load seqkit
seqkit stats SRR1553606_1.fastq SRR1553606_2.fastq
file format type num_seqs sum_len min_len avg_len max_len
SRR1553606_1.fastq FASTQ DNA 10,000 1,010,000 101 101 101
SRR1553606_2.fastq FASTQ DNA 10,000 1,010,000 101 101 101