Introduction to High Performance Computing at NIH: Biowulf
Learning Objectives
- Understand the components of an HPC system. How does this compare to your local desktop?
- Learn about Biowulf, the NIH HPC cluster.
- Learn about the command line interface and resources for learning.
What is a high performance cluster (HPC)?
A collection of standalone computers that are networked together. They will frequently have software installed that allow the coordinated running of other software across all of these computers. This allows these networked computers to work together to accomplish computing tasks faster. --- hpc-intro (Software carpentries)
When using an HPC
-
We use a command line interface and a Secure shell protocol (SSH) to establish a remote connection to the login node / head node
- HPCs are remote resources that require connections using slow or intermitten interfaces (over WIFI and VPNs). It is more practical to guide functionality over the command line using plain text. Most of us are likely used to a graphical user interface (GUI), which is a point-and-click interface, so we will describe how the CLI differs a bit later.
-
the cluster head node distributes compute tasks using a scheduling system (e.g., SLURM).
Note
Slurm stands for Simple Linux Utility for Resource Management.
What is Biowulf?
Biowulf is the high performance computing (HPC) system at NIH.
- The NIH high-performance compute cluster is known as “Biowulf”
- It is a 95,000+ processor Linux cluster
- Can perform large numbers of simultaneous jobs
- Jobs can be split among several nodes
- Scientific software (600+) and databases are already installed
- Can only be accessed on NIH campus or via VPN
When should we use Biowulf?
You should use Biowulf when:
- Software is unavailable or difficult to install on your local computer and is available on Biowulf.
- You are working with large amounts of data that can be parallelized to shorten computational time AND/OR
- You are performing computational tasks that are memory intensive.
Example of High Performance Computing Structure
Essentially Biowulf is a scaled up version of your local computer.
In Biowulf, many computers make up a cluster. Each individual computer or node has disk space for storage and random access memory (RAM) for running tasks. The individual computer is composed of processors, which are further divided into cores, and cores are divided into CPUs.
Info
Information on the NIH HPC architecture and hardware here.
Node types on the HPC
- login node (head node)
- Used for submitting resource intensive tasks as jobs
- Editing and compiling code
- File management and data transfers on a small scale
- compute nodes (worker nodes)
- For computational processes
- Requires interaction with a job scheduling system (SLURM)
- Batch jobs, sinteractive sessions
- Data transfer node (For Biowulf, this is Helix.)
Info
sinteractive
- work on biowulf compute nodes interactively; suitable for testing/debugging cpu-intensive code, Pre/post-processing of data, and using graphical applications.
sbatch
- for submitting shell scripts via jobs, taking away any interactive component.
swarm
- used for runnning embarassingly parallel code as independent jobs.
The Data transfer node: Helix
- Used for data transfers and file management on a large scale.
- 48 core system with 1.5 TB of main memory
- direct internet connection
- Helix should be used when
- you are transferring >100 GB using
scp
- gzipping a directory containing >5K files, or > 50 GB
- copying > 150 GB of data from one directory to another.
- uploading or downloading data from the cloud.
- you are transferring >100 GB using
- For more information on data transfers see hpc.nih.gov.
Biowulf Data Storage
- You may request more space on /data
, but this requires a legitimate justification.
- More information on data storage here.
Important
Data storage on the HPC system should not be for archival purposes.
Note
Though there aren't true back-ups of your data directories, there are snapshots with a view of your home and data directories at a specific point in time. You can learn more about snapshots in the HPC documentation.
Applications on Biowulf
- Bioinformatics applications and other programs are available on Biowulf via modules.
- View a list of available applications here.
Info
Loading software as environment modules allows us to better control our computational environment and easily use a large number of programs and even different versions of the same programs. Modules alter the user's environment varibables such as the executaion path.
Getting an NIH HPC account
- If you do not already have a Biowulf account, you can obtain one by following the instructions here.
- NIH HPC accounts are available to all NIH employees and contractors listed in the NIH Enterprise Directory.
- Obtaining an account requires PI approval and a nominal fee of $35 per month.
- Accounts are renewed annually contigent upon PI approval.
The Command Line Interface (CLI)
What is Unix?
- Unix is a proprietary operating system like Windows or MacOS (Unix based).
- There are many Unix and Unix-like operating systems, including open source Linux and its multiple distributions.
- Biowulf computational nodes use a Unix-like (Linux) operating system (distributions RHEL8/Rocky8).
- Biowulf requires knowledge and use of the command line interface (shell) to direct computational functionality.
- To work on the command line we need to be able to issue Unix commands to tell the computer what we want it to do.
Tip
A basic foundation of Unix is advantageous for most scientists, as many bioinformatics open-source tools are available or accessible by command line on Unix-like systems.
Accessing your local terminal or command prompt
Mac OS
- Type
cmd + spacebar
and search for "terminal". Once open, right click on the app logo in the dock. SelectOptions
andKeep in Dock
.
Windows 10 or greater
You can start an SSH session in your command prompt by executing ssh user@machine and you will be prompted to enter your password. ---Windows documentation
To find the Command Prompt, type cmd
in the search box (lower left), then press Enter
to open the highlighted Command Prompt shortcut.
How much Unix do we need to learn?
To work on Biowulf you really need to understand the following:
- Directory navigation: what the directory tree is, how to navigate and move around with
cd
- Absolute and relative paths: how to access files located in directories
- What simple Unix commands do:
ls
,mv
,rm
,mkdir
,cat
,man
- Getting help: how to find out more on what a unix command does
- What are “flags”: how to customize typical unix programs
ls
vsls -l
- Shell redirection: what is the standard input and output, how to “pipe” or redirect the output of one program into the input of the other --- Biostar Handbook
Connecting to Biowulf
- To connect to Biowulf, we use a secure shell (SSH) protocol.
- used to open an encrypted network connection between two machines, allowing you to send & receive text and data without having to worry about prying eyes.
man ssh
Establishing a remote connection
ssh username@biowulf.nih.gov
"username" = NIH/Biowulf login username.
Note
If this is your first time logging into Biowulf, you will see a warning statement with a yes/no choice. Type "yes".
Type in your password at the prompt. The cursor will not move as you type your password!
SLURM commands
You will also need to know commands specific to the Biowulf job scheduling system:
sbatch
submit slurm jobswarm
submit a swarm of commands to clustersinteractive
allocate an interactive sessionsjobs
show brief summary of queued and running jobssqueue
display status of slurm batch jobscancel
delete slurm jobs
How to load / unload a module
-
To see a list of available software in modules use
module avail module avail [appname|string|regex] module –d
-
To load a module
module load appname module load appname/version
-
To see loaded modules
module list
-
To unload modules
module unload appname module purge #(unload all modules)
Note
You may also create and use your own modules.
Getting help on Biowulf: NIH HPC Documentation
The NIH HPC systems are well-documented at hpc.nih.gov.
Additional HPC help
-
Contact staff@hpc.nih.gov
The HPC team welcomes questions and is happy to offer guidance to address your concerns. -
Monthly Zoom consult sessions
The HPC team offers montly zoom consult sessions. "All problems and concerns are welcome, from scripting problems to node allocation, to strategies for a particular project, to anything that is affecting your use of the HPC systems. The Zoom details are emailed to all Biowulf users the week of the consult." -
Bioinformatics Training and Education Program
If you experience any difficulties or challenges, especially with different bioinformatics applications, please do not hesitate to email us at BTEP.
Learning Unix: Classes / Courses
- Introduction to Biowulf (May – Jun, 2023)
- Introduction to Unix on Biowulf (Jan – Feb, 2023)
- Bioinformatics for Beginners: Module 1 Unix/Biowulf
Additional Unix Resources:
Key points
- Biowulf is the high performance computing cluster at NIH.
- To work on Biowulf, you will need to use the command line interface, which requires some knowledge of unix commands.
- When you apply for a Biowulf account you will be issued two primary storage spaces:
/home/$User
(16 GB)/data/$USER
(100 GB).
- Hundreds of pre-installed bioinformatics programs are available through the
module
system. - Computational tasks on Biowulf should be submitted as a job (
sbatch
,swarm
) or through an interactive session (sinteractive
). - Do not run computational tasks on the login node.