Lesson 4: Working on Biowulf

Lesson 3 Review

Flags and command options
Wildcards (*)
Tab complete
Accessing user history with the "up" and "down" arrows
cat, head, and tail
Working with file content (input, output, and append)
Combining commands with the pipe (|)
grep
for loop
File Permissions

Lesson Objectives

Learn about the slurm system by working on Biowulf. Learn about batch jobs, swarms jobs, interactive sessions.
Retrieve data from NCBI through a batch job.
Learn how to troubleshoot failed jobs.

For this lesson, you will need to connect to Biowulf.

Working on Biowulf

Now that we are becoming more proficient at the command line, we can use these skills to do more than navigate our file system. We can actually begin working on Biowulf. Today's lesson will focus on submitting computational jobs on the Biowulf compute nodes.

Login using `ssh`

To get started, as always, we will need to log in to Biowulf. Make sure you are on VPN.

Open your Terminal if you are using a mac or the Command prompt if you are using a Windows machine.

ssh username@biowulf.nih.gov

username = NIH/Biowulf login username. Remember to use the student account username here.

Type in your password at the prompt. The cursor will not move as you type your password!

When you log in to Biowulf, you are automatically in your home directory (/home/$USER). This directory is very small and not suitable for large data files or analysis.

Use the cd command to change to your data directory (/data) .

cd /data/$USER

where $USER is an environment variable holding your username.

If you do not yet have one, create a directory to work in for this lesson and move to that directory.

mkdir Module_1
cd Module_1

Note

When working on Biowulf, you can not use any computational tools on the "login node". Instead, you need to work on a node or nodes that are sufficient for what you are doing.

To run jobs on Biowulf, you must designate them as interactive, batch or swarm. Failure to do this may result in a temporary account lockout.

What kind of work can we do on the login node?

The login node can be used for:

Submitting resource intensive tasks as jobs
Editing and compiling code
File management and data transfers on a small scale

Batch Jobs

Most jobs on Biowulf should be run as batch jobs using the "sbatch" command.

sbatch yourscript.sh

Where yourscript.sh is a shell script containing the job commands including input, output, cpus-per-task, and other steps. Batch scripts always start with #!/bin/bash or similar call. The sha-bang (#!) tells the computer what command interpreter to use, in this case the Bourne-again shell.

For example, to submit a job checking sequence quality using fastqc (MORE ON THIS LATER), you may create a script named fastqc.sh:

nano fastqc.sh

Inside the script, you may include something like this:

#!/bin/bash

module load fastqc
fastqc -o output_dir -f fastq seqfile1 seqfile2 ... seqfileN

where -o names the output directory

-f states the format of the input file(s)

and seqfile1 ... seqfileN are the names of the sequence files.

Note

fastqc is a available via Biowulf's module system, and so prior to running the command, the module had to be loaded.

For more information on running batch jobs on Biowulf, please see: https://hpc.nih.gov/docs/userguide.html.

Multi-threaded jobs and `sbatch` options

In high-performance computing, multithreading lets a single program split itself into multiple "workers" that can run at the same time on different parts of the computer's brain (CPUs), speeding up complex tasks. Many bioinformatics programs use multi-threading.

For multi-threaded jobs, you will need to set --cpus-per-task. You can do this at the command line or from within your script.

Example at the command line:

sbatch --cpus-per-task=# yourscript.sh

In your script:

#!/bin/bash  
#SBATCH --job-name qc  
#SBATCH --mail-type BEGIN,END
#SBATCH --cpus-per-task #

module load fastqc
fastqc -o output_dir -t $SLURM_CPUS_PER_TASK -f fastq seqfile1 seqfile2 ... seqfileN

Within the script we can use directives denoted by #SBATCH to support command line arguments such as --cpus-per-task. If included within the script, you will not need to call these at the command line when submitting the job. You should also pass the environment variable, $SLURM_CPUS_PER_TASK to the thread argument. Some other useful directives include --job-name, where you assign a name to the submitted job, and --mail-type, which you can use to direct slurm to send you an email when a job begins, ends, or both.

Tip

The jobscript should always be the last argument of sbatch.

To see more sbatch options, use

sbatch --help

Some slurm commands

Once you submit a job, you will need to interact with the slurm system to manage or view details about submitted jobs.

Here are some useful commands for this purpose:

Slurm commands — Courtesy of NIH HPC Team Training Documentation

Standard error and standard output

Output that you would expect to appear in the terminal (e.g., standard error and standard output) will not in batch mode. Rather, these will be written by default to slurm######.out in the submitting directory, where ###### represents the job id. These can be redirected using --output=/path/to/dir/filename and --error=/path/to/dir/filename on the command line or as an #SBATCH directive.

Partitions

Your job may be in a waiting phase ("Pending" or "PD") depending on available resources. You can specify a particular node partition using --partition.

Use freen to see what's available.

Summary of partitions

Student partition

The student accounts have their own partition for running jobs, --partition=student. This will need to be included on the command line or in the job script.

Walltime

The default walltime, or amount of time allocated to a job, is 2 hours on the norm partition. To change the walltime, use --time=d-hh:mm:ss.

Here are the walltimes by partition:

batchlim

Default and Max Walltimes by Partition

You can change the walltime after submitting a job using

newwall --jobid <job_id> --time <new_time_spec>

Submit an actual job

Let's submit a batch job. We are going to download data from the Sequence Read Archive (SRA), a public repository of high throughput, short read sequencing data. We will discuss the SRA a bit more in detail in Lesson 5. For now, our goal is to simply download multiple fastq files associated with a specific BioProject on the SRA. We are interested in downloading RNAseq files associated with BioProject PRJNA578488, which "aimed to determine the genetic and molecular factors that dictate resistance to WNT-targeted therapy in intestinal tumor cells".

We will learn how to pull the files we are interested in directly from SRA at a later date. For now, we will use the run information stored in sra_files_PRJNA578488.txt.

Let's copy this file to our working directory and view using less.

cp /data/classes/BTEP/B4B_2025/Module_1/sra_files_PRJNA578488.txt .
less sra_files_PRJNA578488.txt

Now, let's build a script downloading a single run, SRR10314042, to a directory called /data/$USER/testscript.

mkdir testscript  
cd testscript

Open the text editor nano and create a script named filedownload.sh.

nano filedownload.sh

Inside our script we will type

#!/bin/bash
#SBATCH --cpus-per-task=6 
#SBATCH --gres=lscratch:10
#SBATCH --partition=student

#load module
module load sratoolkit

fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314042

Notice our use of sbatch directives inside the script.

fasterq-dump assigns 6 threads by default, so we are specifying --cpus-per-task=6. This can be modified once we get an idea of the CPUs needed for the job. -t assigns the location to be used for temporary files. Here, we are using /lscratch/$SLURM_JOB_ID, which is created upon job submission. In combination, we need to request local scratch space allocation, which we are setting to 10 GB (--gres=lscratch:10). The final directive, --partition=student is required because we are using student accounts for this lesson. If you are with the Center for Cancer Research and using your own account, you can use --partition=ccr.

Remember:
Default compute allocation = 1 physical core = 2 CPUs
Default Memory Per CPU = 2 GB. Therefore, default memory allocation = 4 GB

Now, let's run the script.

sbatch filedownload.sh

Let's check our job status.

squeue -u $USER

Once the job status changes from PD (pending) to R (running), let's check the job status.

sjobs -u $USER

Some other useful job monitoring commands include jobload, and jobhist. The latter is useful for when the job completes.

When the job completes, we should have a file called SRR10314042.fastq.

ls -lth

Now, what if we want to download all of the runs from sra_files_PRJNA578488.txt. We could use a for loop, GNU parallel, which acts similarly to a for loop (MORE ON THIS NEXT LESSON), or we can submit multiple fasterq-dump jobs in a job array, one subjob per run accession.

Note

There are instructions for running SRA-Toolkit on Biowulf here.

Swarm-ing on Biowulf

Swarm is for running a group of commands (job array) on Biowulf. swarm reads a list of command lines and automatically submits them to the system as sub jobs. "By default, swarm runs one command per core on a node, making optimum use of a node. Thus, a node with 16 cores will run 16 commands in parallel.". To create a swarm file, you can use nano or another text editor and put all of your command lines in a file called file.swarm (file is just a placeholder). Then you will use the swarm command to execute.

For example,

$ swarm -f file.swarm

Note

By default, each subjob is allocated 1.5 gb of memory and 1 core (consisting of 2 cpus) --- hpc.nih.gov

Swarm creates two output files for each command line, one each for STDOUT (file.o) and STDERR (file.e). You can look into these files with the less command to see any important messages.

For example,

$ less swarm_jobid_subjobid.o
$ less swarm_jobid_subjobid.e

View the swarm options using

swarm --help

For more information on swarm-ing on Biowulf, please see: https://hpc.nih.gov/apps/swarm.html

Let's create a swarm job

To retrieve all the files at once, you can create a swarm job.

nano set.swarm

Copy the following and paste into the swarm file.

#SWARM --threads-per-process 6
#SWARM --gb-per-process 4  
#SWARM --gres=lscratch:20 
#SWARM --module sratoolkit 
#SWARM --partition=student

fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SR10314043 #Add error to see this in job output
fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314044
fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314045

Here, based on our swarm directives, each subjob will request 6 cpus and 4 GB of RAM. We are also denoting local scratch space of 20 GB per each command.

There is advice for generating a swarm file using a for loop and echo in the swarm user guide. If you can help it, do not type each of these commands.

Something like this could generate the lines above:

cat ../sra_files_PRJNA578488.txt | while read line; do 
    echo 'fasterq-dump -v -t /lscratch/$SLURM_JOB_ID' $line >> script.sh; 
done

Fix the code

You may have noticed that the while loop above does not produce the same lines. The order of the files is in reverse, and there is an extra line of code.

fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SRR10314045
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SRR10314044
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SRR10314043
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID

How can we fix the lines written to script.sh? Use the skills you have learned and a little help from google.

Possible Solution

One solution is to sort the script using sort and drop the first line of the file using tail -n +2.

sort script.sh | tail -n +2 > script_mod.sh

Let's run our swarm file. Because we included our directives within the swarm file, the only option we need to include is -f for file.

swarm -f set.swarm

Running Interactive Jobs

Interactive nodes are suitable for testing/debugging cpu-intensive code, pre/post-processing of data, graphical application, and to GUI interface to application.

To start an interactive node, type sinteractive at the command line and press Enter/Return on your keyboard.

sinteractive

You will see something like this printed to your screen. You only need to use the sinteractive command once per session. If you try to start an interactive node on top of another interactive node, you will get a message asking why you want to start another node.

[username@biowulf ]$ sinteractive    
salloc.exe: Pending job allocation 34516111    
salloc.exe: job 34516111 queued and waiting for resources    
salloc.exe: job 34516111 has been allocated resources    
salloc.exe: Granted job allocation 34516111    
salloc.exe: Waiting for resource configuration    
salloc.exe: Nodes cn3317 are ready for job    
srun: error: x11: no local DISPLAY defined, skipping    
[username@cn3317 ]$

You can use many of the same options for sinteractive as you can with sbatch. The default sinteractive allocation is 1 core (2 CPUs) and 3 GB (1.5 GB per CPU) of memory and a walltime of 8 hours.

For example,

sinteractive  --gres=lscratch:20  --cpus-per-task=6
module load sratoolkit
fasterq-dump -t /lscratch/$SLURM_JOBID SRR2048331 -O /data/$USER/sra

To terminate / cancel the interactive session use

exit

Exiting from Biowulf

To disconnect the remote connection on Biowulf, use

exit

Transferring files to Biowulf

Before ending this lesson, I want to briefly discuss file transfers. At some point you will need to know how to get your data on Biowulf. The NIH HPC Team has fantastic documentation on this subject at https://hpc.nih.gov/docs/transfer.html.

Remember, Helix (helix.nih.gov) is the interactive data transfer and file management node for the NIH HPC Systems. If you are interactively transferring files to and from NIH HPC directories, you should connect to Helix:

ssh username@helix.nih.gov

where username is your Biowulf username.

Recommended methods for file transfers

Large scale transfers

For large scale transfers and large files use, Globus. Find instructions here.

Small scale transfers

For small scale transfers, use scp to or from Helix or drag and drop files by mounting HPC Systems directories to your local machine.

For example, to copy Module_1 to your local machine, use:

scp -r username@helix.nih.gov:/data/username/Module_1 .

Note

This command and the next are issued from your local computer. The . in the above command means that you are copying Module_1 into your current working directory.

To copy from your local machine to Biowulf, use:

scp -r ./Module_1 username@helix.nih.gov:/data/username/Module_1

Make sure to substitute username with your Biowulf username.

To mount the HPC directories to your local computer, follow the instructions here.

Help Session

Practice the skills learned in this lesson here.

So you think you know Biowulf?

Quiz yourself using the hpc.nih.gov biowulf-quiz.

Lesson 4: Working on Biowulf

Lesson 3 Review

Lesson Objectives

Working on Biowulf

Login using ssh

Batch Jobs

Multi-threaded jobs and sbatch options

Some slurm commands

Standard error and standard output

Partitions

Walltime

Submit an actual job

Swarm-ing on Biowulf

Let's create a swarm job

Running Interactive Jobs

Exiting from Biowulf

Transferring files to Biowulf

Recommended methods for file transfers

Large scale transfers

Small scale transfers

Help Session

So you think you know Biowulf?

Login using `ssh`

Multi-threaded jobs and `sbatch` options