Getting Started with Biowulf

Biowulf is the NIH high performance computing cluster. It is a linux computing cluster with greater than 105,000 processors. The NIH HPC systems also house "hundreds of scientific programs, packages and databases" (https://hpc.nih.gov/apps/).

Bioinformatic processes often require a lot of memory and computational time, which is limited on individual (local) computers. For bioinformatics tasks that require a lot of memory or can be run in parallel to reduce the time to completion, consider performing such tasks on Biowulf. To obtain a Biowulf account, see the Biowulf help pages. A Biowulf account is accessible to all NIH employees and contractors listed in the NIH Enterprise Directory for a nominal fee of $35 a month.

Working on the NIH High Performance Unix Cluster Biowulf

Logging into Biowulf from MacOS

Find the program "Terminal" on your machine, and enter the following statement at the prompt:

ssh username@biowulf.nih.gov

where "username" is your NIH/Biowulf login username.

If this is your first time logging into Biowulf, you will see a warning statement with a yes/no choice. Type "yes".
Type in your password at the prompt. NOTE: The cursor will not move as you type your password! Don't let this fool you. Type in your password in once and hit "return/enter" on your keyboard.
When you see the command prompt dollar sign "$", you will know you are logged in.

[username@biowulf ~] $

Logging into Biowulf from Windows 10 OS

Open the command prompt and start an "SSH" (secure shell) session:

ssh username@biowulf.nih.gov

where "username" is your NIH/Biowulf login username.

If this is your first time logging into Biowulf, you will see a warning statement with a yes/no choice. Type "yes".
Type in your password at the prompt. NOTE: The cursor will not move as you type your password! Don't let this fool you. Type in your password in once and hit "return/enter" on your keyboard.
When you see the command prompt dollar sign "$", you will know you are logged in.

Working on Biowulf - two things you should always do.

When you log into Biowulf, you are automatically in your home directory (/home). This directory is very small and not suitable for large data files or analysis.

Use the "cd" command to change to the /data directory.

$ cd /data/username

where "username" is your username.

When working on Biowulf, you cannot work on the "login node". Instead, you need to work on a node or nodes that are sufficient for what you are doing. For now, you will use the "sinteractive" command to start an interactive session.

$ sinteractive

Being a good citizen on Biowulf

To run jobs on Biowulf, you must designate them as interactive, batch, or swarm. Failure to do this may result in termination of your account.

Running Interactive Jobs

Interactive nodes are suitable for routine tasks and debugging. To start an interactive node, type "sinteractive" at the command line "$" and press Enter/Return on your keyboard.

$ sinteractive

You will see something like this printed to your screen. It may take a minute or so for the command to finish. You'll know it's done when you get your command line dollar sign "$" back. You only need to use the "sinteractive" command once per session. If you try to start an interactive node on top of another interactive node, you will get a message asking why you want to start another node.

[username@biowulf ]$ sinteractive    
salloc.exe: Pending job allocation 34516111    
salloc.exe: job 34516111 queued and waiting for resources    
salloc.exe: job 34516111 has been allocated resources    
salloc.exe: Granted job allocation 34516111    
salloc.exe: Waiting for resource configuration    
salloc.exe: Nodes cn3317 are ready for job    
srun: error: x11: no local DISPLAY defined, skipping    
[username@cn3317 ]$

Batch Jobs

Most jobs on Biowulf should be run as batch jobs using the "sbatch" command.

$ sbatch yourscript.sh

Where "yourscript.sh" contains the job commands including input, output, cpus-per-task, and others. Batch scripts always start with "#!/bin/bash".

For example:

#!/bin/bash

module load fastqc
fastqc -o output_dir -f fastq seqfile1 seqfile2 ... seqfileN

where -o names the output directory

-f states the format of the input file(s)

and seqfile1 ... seqfileN are the names of the sequence files.

For more information on running batch jobs on Biowulf, please see: https://hpc.nih.gov/docs/userguide.html.

For multi-threaded jobs, you will need to set "cpus-per-task" like this. You can do this at the command line or put it in your script.

At the command line:

$ sbatch --cpus-per-task=# yourscript.sh

Or in your script:

#!/bin/bash

module load fastqc
fastqc -o output_dir  $SLURM_CPUS_PER_TASK -f fastq seqfile1 seqfile2 ... seqfileN

Swarm-ing on Biowulf

Swarm is a script for running a group of commands on Biowulf. Swarm reads a list of command lines and automatically submits them to the system. To create a swarm file, you can use "nano" or another text editor and put all of your command lines in a file called "file.swarm". Then you will use the "swarm" command to execute.

$ swarm -f file.swarm

Swarm creates two output files for each command line, one each for STDOUT (file.o) and STDERR (file.e). You can look into these files with the "less" command to see any important messages.

$ less file.o
$ less file.e

For more information on swarm-ing on Biowulf, please see: https://hpc.nih.gov/apps/swarm.html