Getting Started with Biowulf
Biowulf is the NIH high performance computing cluster. It is a linux computing cluster with greater than 105,000 processors. The NIH HPC systems also house "hundreds of scientific programs, packages and databases" (https://hpc.nih.gov/apps/).
Bioinformatic processes often require a lot of memory and computational time, which is limited on individual (local) computers. For bioinformatics tasks that require a lot of memory or can be run in parallel to reduce the time to completion, consider performing such tasks on Biowulf. To obtain a Biowulf account, see the Biowulf help pages. A Biowulf account is accessible to all NIH employees and contractors listed in the NIH Enterprise Directory for a nominal fee of $35 a month.
Working on the NIH High Performance Unix Cluster Biowulf
Logging into Biowulf from MacOS
Find the program "Terminal" on your machine, and enter the following statement at the prompt:
ssh username@biowulf.nih.gov
where "username" is your NIH/Biowulf login username.
- If this is your first time logging into Biowulf, you will see a warning statement with a yes/no choice. Type "yes".
- Type in your password at the prompt. NOTE: The cursor will not move as you type your password! Don't let this fool you. Type in your password in once and hit "return/enter" on your keyboard.
- When you see the command prompt dollar sign "$", you will know you are logged in.
[username@biowulf ~] $
Logging into Biowulf from Windows 10 OS
Open the command prompt and start an "SSH" (secure shell) session:
ssh username@biowulf.nih.gov
where "username" is your NIH/Biowulf login username.
- If this is your first time logging into Biowulf, you will see a warning statement with a yes/no choice. Type "yes".
- Type in your password at the prompt. NOTE: The cursor will not move as you type your password! Don't let this fool you. Type in your password in once and hit "return/enter" on your keyboard.
- When you see the command prompt dollar sign "$", you will know you are logged in.
Working on Biowulf - two things you should always do.
When you log into Biowulf, you are automatically in your home directory (/home). This directory is very small and not suitable for large data files or analysis.
Use the "cd" command to change to the /data directory.
$ cd /data/username
where "username" is your username.
When working on Biowulf, you cannot work on the "login node". Instead, you need to work on a node or nodes that are sufficient for what you are doing. For now, you will use the "sinteractive" command to start an interactive session.
$ sinteractive
Being a good citizen on Biowulf
To run jobs on Biowulf, you must designate them as interactive, batch, or swarm. Failure to do this may result in termination of your account.
Running Interactive Jobs
Interactive nodes are suitable for routine tasks and debugging. To start an interactive node, type "sinteractive" at the command line "$" and press Enter/Return on your keyboard.
$ sinteractive
You will see something like this printed to your screen. It may take a minute or so for the command to finish. You'll know it's done when you get your command line dollar sign "$" back. You only need to use the "sinteractive" command once per session. If you try to start an interactive node on top of another interactive node, you will get a message asking why you want to start another node.
[username@biowulf ]$ sinteractive
salloc.exe: Pending job allocation 34516111
salloc.exe: job 34516111 queued and waiting for resources
salloc.exe: job 34516111 has been allocated resources
salloc.exe: Granted job allocation 34516111
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3317 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
[username@cn3317 ]$
Batch Jobs
Most jobs on Biowulf should be run as batch jobs using the "sbatch" command.
$ sbatch yourscript.sh
Where "yourscript.sh" contains the job commands including input, output, cpus-per-task, and others. Batch scripts always start with "#!/bin/bash".
For example:
#!/bin/bash
module load fastqc
fastqc -o output_dir -f fastq seqfile1 seqfile2 ... seqfileN
where -o names the output directory
-f states the format of the input file(s)
and seqfile1 ... seqfileN are the names of the sequence files.
For more information on running batch jobs on Biowulf, please see: https://hpc.nih.gov/docs/userguide.html.
For multi-threaded jobs, you will need to set "cpus-per-task" like this. You can do this at the command line or put it in your script.
At the command line:
$ sbatch --cpus-per-task=# yourscript.sh
Or in your script:
#!/bin/bash
module load fastqc
fastqc -o output_dir $SLURM_CPUS_PER_TASK -f fastq seqfile1 seqfile2 ... seqfileN
Swarm-ing on Biowulf
Swarm is a script for running a group of commands on Biowulf. Swarm reads a list of command lines and automatically submits them to the system. To create a swarm file, you can use "nano" or another text editor and put all of your command lines in a file called "file.swarm". Then you will use the "swarm" command to execute.
$ swarm -f file.swarm
$ less file.o
$ less file.e
For more information on swarm-ing on Biowulf, please see: https://hpc.nih.gov/apps/swarm.html