ncibtep@nih.gov

Bioinformatics Training and Education Program

Tips for Submitting Batch Jobs to Biowulf

Biowulf is the Unix/Linux-based high-performance computing (HPC) cluster at NIH. If you do not currently use Biowulf for your data analysis needs, consider the following advantages:

  • Much more compute power as compared to a personal computer, so users can work with large datasets.
  • The ability to transfer large datasets using Globus.
  • Approximately 900+ scientific software modules are installed and ready to use.
  • Access to databases and reference genomes.
  • Users can install personal software.
  • The ability to work interactively on Biowulf’s compute nodes.
  • The ability to submit a job to the system, walk away from it, and check on results when Biowulf sends an email notification that the job has been completed.

While the thought of submitting non-interactive jobs via the command line may seem intimidating, we hope to alleviate some of your concerns in this month’s topic spotlight, which outlines the job submission process on Biowulf.

Submission of jobs to Biowulf requires users to first write a script. This can be done using one of the installed editor programs on Biowulf. The Biowulf system accommodates shell scripts, which have the “.sh” extension or swarm scripts, which have an extension of “.swarm”. Shell scripts are good for tying together analysis steps, which facilitates reproducibility and reuse. Components of a shell script include:

  • Path to the command interpreter (e.g., “#!/bin/bash”)
  • Lines containing job directives which start with #SBATCH. Job directives tell Biowulf information such as how much computing resources and time are needed for a job. Users can also specify to have job status notifications sent to their NIH email.
  • The commands that perform various steps of the analysis.

The command to submit a shell script is sbatch, which is followed by the name of the script. See https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/biowulf_batch_jobs/#shell-script for an example of shell script submission to Biowulf.

Swarm scripts are appropriate when users want to run a group of commands. Each command in a swarm script will run as its own job, thus facilitating parallelization. Swarm scripts also use directives at the beginning to tell Biowulf the compute resources and applications needed. These directives start with #SWARM. To submit a swarm script, use swarm -f, which is followed by the name of the swarm script. See https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/biowulf_batch_jobs/#swarm for an example of swarm script submission to Biowulf.

The computational resources needed depend on many factors and require trial and error to optimize. However, the following tips maybe helpful.

  • Consult the Biowulf user dashboard (https://hpcnihapps.cit.nih.gov/auth/dashboard/), which is an excellent tool that documents the resources (ie. memory and cpus) requested and used. Users can modify the computation resource requirements for a job based on the information provided in the user dashboard.
  • Get to know the software. For instance, the RNA sequencing aligners STAR and HISAT2 have different memory requirements.
  • Refer to the Biowulf user guide (https://hpc.nih.gov/docs/userguide.html). For instance, the user guide recommends 30-45 GB of memory when aligning to mammalian genomes and 45 GB of memory when aligning to human genome using STAR (https://hpc.nih.gov/apps/STAR.html).  On the other hand, less than 8 GB of memory can be used when aligning to GRCh38 human genome using HISAT2 (communication with Biowulf).
  • Consult with Biowulf staff either through email staff@hpc.nih.gov or monthly virtual consults (https://hpc.nih.gov/training/).

Submitting jobs to the Biowulf system allows users the ability to harness its compute power for data analysis. While some trial-and-error may be necessary to determine the optimal parameters for running jobs, this remains the best method for job reuse, reproducibility, and parallelization of resources. 

If you need help submitting jobs on Biowulf, there is extensive documentation available on the Biowulf (https://hpc.nih.gov) and BTEP (https://bioinformatics.ccr.cancer.gov/btep) websites. You can also contact Biowulf staff at staff@hpc.nih.gov or BTEP at ncibtep@nih.gov.

Joe Wu – (BTEP)