Skip to content

Introduction to High Performance Computing at NIH: Biowulf

Learning Objectives

  1. Understand the components of an HPC system. How does this compare to your local desktop?
  2. Learn about Biowulf, the NIH HPC cluster.
  3. Learn about the command line interface and resources for learning.

What is a high performance cluster (HPC)?

A collection of standalone computers that are networked together. They will frequently have software installed that allow the coordinated running of other software across all of these computers. This allows these networked computers to work together to accomplish computing tasks faster. --- hpc-intro (Software carpentries)


When using an HPC

Note

Slurm stands for Simple Linux Utility for Resource Management.


What is Biowulf?

Biowulf is the high performance computing (HPC) system at NIH.

  • The NIH high-performance compute cluster is known as “Biowulf”
  • It is a 95,000+ processor Linux cluster
  • Can perform large numbers of simultaneous jobs
  • Jobs can be split among several nodes
  • Scientific software (600+) and databases are already installed
  • Can only be accessed on NIH campus or via VPN

When should we use Biowulf?

You should use Biowulf when:

  • Software is unavailable or difficult to install on your local computer and is available on Biowulf.
  • You are working with large amounts of data that can be parallelized to shorten computational time AND/OR
  • You are performing computational tasks that are memory intensive.

Example of High Performance Computing Structure

Essentially Biowulf is a scaled up version of your local computer.

In Biowulf, many computers make up a cluster. Each individual computer or node has disk space for storage and random access memory (RAM) for running tasks. The individual computer is composed of processors, which are further divided into cores, and cores are divided into CPUs.

Info

Information on the NIH HPC architecture and hardware here.


Node types on the HPC

  • login node (head node)
    • Used for submitting resource intensive tasks as jobs
    • Editing and compiling code
    • File management and data transfers on a small scale
  • compute nodes (worker nodes)
    • For computational processes
    • Requires interaction with a job scheduling system (SLURM)
    • Batch jobs, sinteractive sessions
  • Data transfer node (For Biowulf, this is Helix.)

Info

sinteractive - work on biowulf compute nodes interactively; suitable for testing/debugging cpu-intensive code, Pre/post-processing of data, and using graphical applications.
sbatch - for submitting shell scripts via jobs, taking away any interactive component.
swarm - used for runnning embarassingly parallel code as independent jobs.


The Data transfer node: Helix

  • Used for data transfers and file management on a large scale.
  • 48 core system with 1.5 TB of main memory
  • direct internet connection
  • Helix should be used when
    • you are transferring >100 GB using scp
    • gzipping a directory containing >5K files, or > 50 GB
    • copying > 150 GB of data from one directory to another.
    • uploading or downloading data from the cloud.
  • For more information on data transfers see hpc.nih.gov.

Biowulf Data Storage


- You may request more space on /data, but this requires a legitimate justification.
- More information on data storage here.

Important

Data storage on the HPC system should not be for archival purposes.

Note

Though there aren't true back-ups of your data directories, there are snapshots with a view of your home and data directories at a specific point in time. You can learn more about snapshots in the HPC documentation.


Applications on Biowulf

  • Bioinformatics applications and other programs are available on Biowulf via modules.
  • View a list of available applications here.

Info

Loading software as environment modules allows us to better control our computational environment and easily use a large number of programs and even different versions of the same programs. Modules alter the user's environment varibables such as the executaion path.


Getting an NIH HPC account

  • If you do not already have a Biowulf account, you can obtain one by following the instructions here.
  • NIH HPC accounts are available to all NIH employees and contractors listed in the NIH Enterprise Directory.
  • Obtaining an account requires PI approval and a nominal fee of $35 per month.
  • Accounts are renewed annually contigent upon PI approval.

The Command Line Interface (CLI)

What is Unix?

  • Unix is a proprietary operating system like Windows or MacOS (Unix based).
  • There are many Unix and Unix-like operating systems, including open source Linux and its multiple distributions.
  • Biowulf computational nodes use a Unix-like (Linux) operating system (distributions RHEL8/Rocky8).
  • Biowulf requires knowledge and use of the command line interface (shell) to direct computational functionality.
  • To work on the command line we need to be able to issue Unix commands to tell the computer what we want it to do.

Tip

A basic foundation of Unix is advantageous for most scientists, as many bioinformatics open-source tools are available or accessible by command line on Unix-like systems.


Accessing your local terminal or command prompt

Mac OS

  • Type cmd + spacebar and search for "terminal". Once open, right click on the app logo in the dock. Select Options and Keep in Dock.

Windows 10 or greater

You can start an SSH session in your command prompt by executing ssh user@machine and you will be prompted to enter your password. ---Windows documentation

To find the Command Prompt, type cmd in the search box (lower left), then press Enter to open the highlighted Command Prompt shortcut.


How much Unix do we need to learn?

To work on Biowulf you really need to understand the following:

  1. Directory navigation: what the directory tree is, how to navigate and move around with cd
  2. Absolute and relative paths: how to access files located in directories
  3. What simple Unix commands do: ls, mv, rm, mkdir, cat, man
  4. Getting help: how to find out more on what a unix command does
  5. What are “flags”: how to customize typical unix programs ls vs ls -l
  6. Shell redirection: what is the standard input and output, how to “pipe” or redirect the output of one program into the input of the other --- Biostar Handbook

Connecting to Biowulf


Establishing a remote connection

ssh username@biowulf.nih.gov  

"username" = NIH/Biowulf login username.

Note

If this is your first time logging into Biowulf, you will see a warning statement with a yes/no choice. Type "yes".

Type in your password at the prompt. The cursor will not move as you type your password!


SLURM commands

You will also need to know commands specific to the Biowulf job scheduling system:

  • sbatch submit slurm job
  • swarm submit a swarm of commands to cluster
  • sinteractive allocate an interactive session
  • sjobs show brief summary of queued and running jobs
  • squeue display status of slurm batch job
  • scancel delete slurm jobs

How to load / unload a module

  • To see a list of available software in modules use

      module avail  
      module avail [appname|string|regex]  
      module –d  
    

  • To load a module

        module load appname  
        module load appname/version    
    

  • To see loaded modules

        module list    
    

  • To unload modules

      module unload appname  
      module purge #(unload all modules)    
    

Note

You may also create and use your own modules.


Getting help on Biowulf: NIH HPC Documentation

The NIH HPC systems are well-documented at hpc.nih.gov.


Additional HPC help


Learning Unix: Classes / Courses


Additional Unix Resources:


Key points

  • Biowulf is the high performance computing cluster at NIH.
  • To work on Biowulf, you will need to use the command line interface, which requires some knowledge of unix commands.
  • When you apply for a Biowulf account you will be issued two primary storage spaces:
    1. /home/$User (16 GB)
    2. /data/$USER (100 GB).
  • Hundreds of pre-installed bioinformatics programs are available through the module system.
  • Computational tasks on Biowulf should be submitted as a job (sbatch, swarm) or through an interactive session (sinteractive).
  • Do not run computational tasks on the login node.