Lesson 5: Interactive sessions, modules, and bioinformatics applications on Biowulf

Quick review

In the previous lesson, we learned to move, rename, and remove files as well as directories in Unix. Commands that we learned include

mv (to move or rename file or directories)
tree (to generate a directory tree)
rm (to remove files or directories)
rmdir (to remove empty directories)

Lesson objectives

After this lesson, we should be able to

Request an interactive session on Biowulf
Know how to find out what applications are available on Biowulf
Know how to download high throughput sequencing data from NCBI SRA
Be able to assess quality of high throughput sequencing data

Unix commands that we will learn in this lesson

sinteractive (to request an interactive session on Biowulf)
module (to view, load, or unload applications that are installed on Biowulf)
fastq-dump (to download FASTQ files from NCBI SRA)
head (to view beginning of a file; defaults to the first 10 lines)
fastqc (to assess sequencing data quality)

Requesting an interactive session

Recall that we are not supposed to use the login nodes to perform any computation intensive tasks on Biowulf. Instead, we should either submit a job (if staying in the log in node) with sufficient resources requested or request an interactive session if we are going to be doing some testing and development.

The log in node is meant for the following (Source: Biowulf accounts and log in node)

Submitting jobs (main purpose)
Editing/compiling code
File management
File transfer
Brief testing of code or debugging (under 20 minutes)

Today, we are going to learn to request an interactive session, which is suitable for

testing/debugging cpu-intensive code
pre/post-processing of data
use of graphical application

To start an interactive node, type sinteractive at the prompt and press Enter/Return on your keyboard.

sinteractive

You will see a message similar to that shown below as the resource request is being processed and allocated. We only need to use the sinteractive command once per session. If we try to start an interactive node on top of another interactive node, we will get a message asking why we want to start another node. Note that our prompt switches from username@biowulf to username@cn#### (where #### is a number) to denote that we are now on a compute note rather than the log in node. In this example, I was connected to cn4269. Note the job ID of 55405280, which we will come back to in a bit. Ignore the errors that show up.

[wuz8@biowulf ~]$ sinteractive
salloc: Pending job allocation 55405280
salloc: job 55405280 queued and waiting for resources
salloc: job 55405280 has been allocated resources
salloc: Granted job allocation 55405280
salloc: Waiting for resource configuration
salloc: Nodes cn4269 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
error: unable to open file /tmp/slurm-spank-x11.55405280.0
slurmstepd: error: x11: unable to read DISPLAY value
[wuz8@cn4269 ~]$

The default sinteractive allocation is 1 core (2 CPUs) and 0.768 GB/CPU of memory and a walltime of 8 hours. We can use the jobhist command followd by $SLURM_JOBID (recall the job ID of 55405280 above) and this will give us information on the resources that we asked for as well as the amount of time we have spent on the job.

jobhist $SLURM_JOBID

Note that while the MemReq shows 2 GB of RAM was requested, it is actually 1.5 GB of RAM (0.768 GB x 2 CPU). Biowulf just rounded to the nearest integer.

SLURM_JOBID is known as an environmental variable in the Unix world (see below for the definition). We can set environmental variables for many things including long directory paths that we would not want to repeatedly type. To reference an environmental variable, we prefix the "$" in front of it.

"Environment variables or ENVs basically define the behavior of the environment. They can affect the processes ongoing or the programs that are executed in the environment." -- https://www.geeksforgeeks.org/environment-variables-in-linux-unix/.

NCI CCR partition

Note that when we ran jobhist $SLURM_JOBID above, a table with information regarding the interactive session appears. One of the columns is labeled partition and it tells us that we were taken to the interactive partition upon requesting an interactive session.

"Partitions define limitations that restrict the resources that can be requested for a job submitted to that partition. The limitations affect the maximum run time, the amount of memory, and the number of available CPU cores (which are called CPUs in Slurm)." -- https://wiki.hpcuser.uni-oldenburg.de/index.php?title=Partitions. "Jobs should be submitted to the partition that best matches the required resources." -- https://wiki.hpcuser.uni-oldenburg.de/index.php?title=Partitions.

"NCI-CCR has funded 153 nodes (4548 physical cores, 9096 cpus with hyperthreading) in the Biowulf cluster, and CCR users have priority access to these nodes. This priority status will last until Febuary 20, 2021 (FY2017 funded nodes), Apr 15, 2022 (FY2018 funded nodes) and May 18, 2023 (FY2019 funded nodes)." -- Biowulf NCI CCR partition

To request an interactive session in the CCR partition use

sinteractive --constraint=ccr

In the above, --constraint is an option that can be used with sinteractive to specify the partition in which we want to run our task. Other useful options can be found in the Biowulf user guide.

To learn more about partitions on Biowulf see https://hpc.nih.gov/docs/userguide.html. Use freen to see the available and free resources for the different Biowulf Partitions. To check on limitations for Biowulf partitions, use the batchlim command.

To terminate an interactive session, type exit at the prompt.

Requesting lscratch space

Remember that each node in Biowulf has some amount of space that could be used to store temporary data (lscratch). These can be used for applications that write many temporary files, such as the sratoolkit. To request lscratch space, include the --gres option in sinteractive. The option --gres stands for generic resource.

If we terminated our current interactive session using exit we can then request another interactive session with lscratch space.

exit

Successful exit of an interactive session produces the message below where interactive_session_job_id is the job ID of the interactive session.

exit
salloc: Relinquishing job allocation interactive_session_job_id

In the example below, we set --gres (ie. generic resource) to lscratch, followed by ":" and then the amount of space we need (in gigabytes or GB, here we ask for 15 GB, so the construct is lscratch:15). GB is the default space size unit when requesting space in lscratch.

sinteractive --gres=lscratch:15

An application that requires lscratch is the sratoolkit, which can be used to download high throughput sequencing data from NCBI SRA.

Modules

Biowulf staff has installed many applications, including those used in genomic data analysis. In general, to view the applications that are available on Biowulf, we can use the module command, with its avail subcommand. This will essentially print out a list of applications that are on Biowulf and we can use the up and down arrows to navigate and view the list. We hit "q" to exit this list.

module avail

To list only the default version of each application, include the -d option in module avail.

module -d avail

To check if a specific application is available, you can append the name of the module after `module avail. For instance, we do that with the genomic sequencing Star aligner Bowtie below.

module avail star

We can use the whatis subcommand to see information regarding a specific tool and also to confirm if Biowulf has it installed. For instance, we can check for fastqc, which is an application used to assess quality of high throughput sequencing data. The output provides a description of what the tool does and the default version if we load the tool. The whatis subcommand is case sensitive.

module whatis fastqc

If do we module whatis fast and then hit the tab button (ie. to tab complete), we can see the applications that begins with fast. Here, we have several different versions of fastqc available.

To load an application we can use module load. Let's load the sratool kit and fastqc. By default, the latest version of an application is loaded.

module load sratoolkit

Running on cn4303  ... 
[+] Loading sratoolkit 3.0.2  ...

module load fastqc

[+] Loading fastqc  0.11.9

Note that we can change the version of an application that we want to use (provided that Biowulf has it installed). For instance version 0.11.2 of fastqc rather than 0.11.9, all we have to do is to reload the application by including a "/" followed by the version number.

module load fastqc/0.11.2

[-] Unloading fastqc  0.11.9 
[+] Loading fastqc  0.11.2 

The following have been reloaded with a version change:
  1) fastqc/0.11.9 => fastqc/0.11.2

But, we will be using fastqc in a bit so let's reload with the latest version, which is 0.11.9.

module load fastqc

Modules that you have loaded are unloaded once you exit Biowulf, so you will need to reload again at the next sign in.

Exploring bioinformatics tools

Here, we will download some high throughput genomic sequences from NCBI SRA. The data that we will download were derived from sequencing of the Zaire Ebola virus. See the NCBI SRA page for this study for more details.

We will use a command called fastq-dump within the sratoolkit to grab the first 10000 reads for this sequencing run. In the syntax for fastq-dump

--split-files will generate two files that contains the forward and reverse reads from paired-end sequencing.
-X allows us to input how many reads we want to obtain (here, we just want the first 10000 reads to save time and computation resources for this class)
Finally, we enter the SRA accession number of the sequencing data that we want to download (SRR1553606 in this example).
We will download this into our data directory, so change into if you are not in the data directory already.

cd /data/username

fastq-dump --split-files -X 10000 SRR1553606

Listing the contents of our data directory, we should see the two FASTQ files that were downloaded.

ls

SRR1553606_1.fastq
SRR1553606_2.fastq

We can use head -n 4 view the first 4 lines of the SRR1553606_1.fastq file to see what a FASTQ file looks like. We will talk a bit more about the head command in lesson 7.

head -n 4 SRR1553606_1.fastq

Essentially, FASTQ files contain our high throughput sequencing data. Each sequence read starts with a metadata header line that begins with "@", followed by the actual sequence, then a "+" followed by a line with quality scores that tells us the error likelihood for each of the bases in the sequencing read. This pattern of four lines repeats for all of sequencing reads we have in our FASTQ file (in this case, we should have 10000 sequencing reads because that is how many we asked fastq-dump to download).

One of the things we need to do after receiving our sequencing data is to assay the quality of the data. We can do this using fastqc.

fastqc --help

From the fastqc help documents, we see that in general, to run fastqc, we just need to provide it the names of the FASTQ files.

fastqc seqfile1 seqfile2 .. seqfileN

Let's wrap up this lesson by running fastqc for SRR1553606_1.fastq and SRR1553606_2.fastq

fastqc SRR1553606_1.fastq SRR1553606_2.fastq

As it runs, we can see the analysis progress.

Started analysis of SRR1553606_1.fastq
Approx 10% complete for SRR1553606_1.fastq
Approx 20% complete for SRR1553606_1.fastq
Approx 30% complete for SRR1553606_1.fastq
Approx 40% complete for SRR1553606_1.fastq
Approx 50% complete for SRR1553606_1.fastq
Approx 60% complete for SRR1553606_1.fastq
Approx 70% complete for SRR1553606_1.fastq
Approx 80% complete for SRR1553606_1.fastq
Approx 90% complete for SRR1553606_1.fastq
Approx 100% complete for SRR1553606_1.fastq
Analysis complete for SRR1553606_1.fastq
Started analysis of SRR1553606_2.fastq
Approx 10% complete for SRR1553606_2.fastq
Approx 20% complete for SRR1553606_2.fastq
Approx 30% complete for SRR1553606_2.fastq
Approx 40% complete for SRR1553606_2.fastq
Approx 50% complete for SRR1553606_2.fastq
Approx 60% complete for SRR1553606_2.fastq
Approx 70% complete for SRR1553606_2.fastq
Approx 80% complete for SRR1553606_2.fastq
Approx 90% complete for SRR1553606_2.fastq
Approx 100% complete for SRR1553606_2.fastq
Analysis complete for SRR1553606_2.fastq

The quality assessment reports for SRR1553606_1.fastq and SRR1553606_2.fastq are written into SRR1553606_1_fastqc.html and SRR1553606_2_fastqc.html, respectively as evident when we list the contents of our data folder after fastqc has completed. To view these, we will need to transfer these to our local desktop (will discuss in Lesson 6).

In the ls command below, we use -1 to list one item per row.

ls -1

SRR1553606_1.fastq
SRR1553606_1_fastqc.html
SRR1553606_1_fastqc.zip
SRR1553606_2.fastq
SRR1553606_2_fastqc.html
SRR1553606_2_fastqc.zip
SRR1553606_fastqc_log
SRR1553606_fastqc.sh