Lesson 5: Working on Biowulf

Lesson 4 Review

Flags and command options
Wildcards (*)
Tab complete
Accessing user history with the "up" and "down" arrows
cat, head, and tail
Working with file content (input,output, and append)
Combining commands with the pipe (|)
grep
for loop
File Permissions

Lesson Objectives

Learn about the slurm system by working on Biowulf: batch jobs, swarms jobs, interactive sessions
Retrieve data from NCBI through a batch job
Learn how to troubleshoot failed jobs

NOTE: for this session, you will need to either login to your Biowulf account or use a student account.

Working on Biowulf

Now that we are becoming more proficient at the command line, we can use these skills to begin working on Biowulf. Today's lesson will focus on submitting computational jobs on the Biowulf compute nodes.

Login using `ssh`

Login to Biowulf. Make sure you are on VPN.

Open your Terminal if you are using a mac or the Command prompt if you are using a Windows machine.

ssh username@biowulf.nih.gov

When you log into Biowulf, you are automatically in your home directory (/home). This directory is very small and not suitable for large data files or analysis.

Use the "cd" command to change to the /data directory.

$ cd /data/$USER

where $USER is an environment variable holding your username.

When working on Biowulf, you can not use any computational tools on the "login node". Instead, you need to work on a node or nodes that are sufficient for what you are doing.

To run jobs on Biowulf, you must designate them as interactive, batch or swarm. Failure to do this may result in a temporary account lockout.

Batch Jobs

Most jobs on Biowulf should be run as batch jobs using the "sbatch" command.

$ sbatch yourscript.sh

Where yourscript.sh is a shell script containing the job commands including input, output, cpus-per-task, and other steps. Batch scripts always start with #!/bin/bash or similar call. The sha-bang (#!) tells the computer what command interpreter to use, in this case the Bourne-again shell.

For example, to submit a job checking sequence quality using fastqc (MORE ON THIS LATER), you may create a script named fastqc.sh:

nano fastqc.sh

Inside the script, you may include something like this:

#!/bin/bash

module load fastqc
fastqc -o output_dir -f fastq seqfile1 seqfile2 ... seqfileN

where -o names the output directory

-f states the format of the input file(s)

and seqfile1 ... seqfileN are the names of the sequence files.

Note: fastqc is a available via Biowulf's module system, and so prior to running the command, the module had to be loaded.

For more information on running batch jobs on Biowulf, please see: https://hpc.nih.gov/docs/userguide.html

Multi-threaded jobs and `sbatch` options

For multi-threaded jobs, you will need to set --cpus-per-task. You can do this at the command line or from within your script.

Example at the command line:

$ sbatch --cpus-per-task=# yourscript.sh

In your script:

#!/bin/bash  
#SBATCH --job-name qc  
#SBATCH --mail-type BEGIN,END
#SBATCH --cpus-per-task #

module load fastqc
fastqc -o output_dir -t $SLURM_CPUS_PER_TASK -f fastq seqfile1 seqfile2 ... seqfileN

Within the script we can use directives denoted by #SBATCH to support command line arguments such as --cpus-per-task. If included within the script, you will not need to call these at the command line when submitting the job. You should also pass the environment variable, $SLURM_CPUS_PER_TASK to the thread argument Some other useful directives include --job-name, where you assign a name to the submitted job, and --mail-type, which you can use to direct slurm to send you an email when a job begins, ends, or both.

NOTE: the jobscript should always be the last argument of sbatch.

To see more sbatch options, use

sbatch --help

Some slurm commands

Once you submit a job, you will need to interact with the slurm system to manage or view details about submitted jobs.

Here are some useful commands for this purpose:
Slurm commands

Standard error and standard output

Output that you would expect to appear in the terminal (e.g., standard error and standard output) will not in batch mode. Rather, these will be written by default to slurm######.out in the submitting directory, where ###### represents the job id. These can be redirected using --output=/path/to/dir/filename and --error=/path/to/dir/filename on the command line or as an #SBATCH directive.

Partitions

Your job may be in a waiting phase ("Pending" or "PD") depending on available resources. You can specify a particular node partition using --partition.

Use freen to see what's available.

Summary of partitions

Walltime

The default walltime, or amount of time allocated to a job, is 2 hours on the norm partition. To change the walltime, use --time=d-hh:mm:ss.

Here are the walltimes by partition:

batchlim

Default and Max Walltimes by Partition

You can change the walltime after submitting a job using

newwall --jobid <job_id> --time <new_time_spec>

Submit an actual job

Let's submit a batch job. We are going to download data from the Sequence Read Archive (SRA), a public repository of high throughput, short read sequencing data. We will discuss the SRA a bit more in detail in Lesson 6. For now, our goal is to simply downlaod multiple fastq files associated with a specific BioProject on the SRA. We are interested in downloading RNAseq files associated with BioProject PRJNA578488, which "aimed to determine the genetic and molecular factors that dictate resistance to WNT-targeted therapy in intestinal tumor cells".

We will learn how to pull the files we are interested in directly from SRA at a later date. For now, we will use the run information stored in sra_files_PRJNA578488.txt.

Let's check out this file.

less sra_files_PRJNA578488.txt

Now, let's build a script downloading a single run, SRR10314042, to a directory called /data/$USER/testscript.

mkdir /data/$USER/testscript

Open the text editor nano and create a script named filedownload.sh.

nano filedownload.sh

Inside our script we will type

#!/bin/bash
#SBATCH --cpus-per-task=6 
#SBATCH --gres=lscratch:10

#load module
module load sratoolkit

fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314042

Notice our use of sbatch directives inside the script.

fasterq-dump assigns 6 threads by default, so we are specifying --cpus-per-task=6. This can be modified once we get an idea of the CPUs needed for the job. -t assigns the location to be used for temporary files. Here, we are using /lscratch/$SLURM_JOB_ID, which is created upon job submission. In combination, we need to request local scratch space allocation, which we are setting to 10 GB (--gres=lscratch:10).

Remember:
Default compute allocation = 1 physical core = 2 CPUs Default Memory Per CPU = 2 GB Therefore, default memory allocation = 4 GB

Now, let's run the script.

sbatch filedownload.sh

Let's check our job status.

squeue -u $USER

Once the job status changes from PD (pending) to R (running), let's check the job status.

sjobs -u $USER

When the job completes, we should have a file called SRR10314042.fastq.

ls -lth

Now, what if we want to download all of the runs from sra_files_PRJNA578488.txt. We could use GNU parallel, which acts similarly to a for loop (MORE ON THIS NEXT LESSON), or we can submit multiple fasterq-dump jobs in a job array, one subjob per run accession.

NOTE: There are instructions for running SRA-Toolkit on Biowulf here.

Swarm-ing on Biowulf

Swarm is for running a group of commands (job array) on Biowulf. swarm reads a list of command lines and automatically submits them to the system as sub jobs. To create a swarm file, you can use nano or another text editor and put all of your command lines in a file called file.swarm. Then you will use the swarm command to execute.

By default, each subjob is allocated 1.5 gb of memory and 1 core (consisting of 2 cpus) --- hpc.nih.gov

$ swarm -f file.swarm

Swarm creates two output files for each command line, one each for STDOUT (file.o) and STDERR (file.e). You can look into these files with the "less" command to see any important messages.

$ less swarm_jobid_subjobid.o
$ less swarm_jobid_subjobid.e

View the swarm options using

swarm --help

For more information on swarm-ing on Biowulf, please see: https://hpc.nih.gov/apps/swarm.html

Let's create a swarm job

To retrieve all the files at once, you can create a swarm job.

nano set.swarm

Inside the swarm file type

#SWARM --threads-per-process 3
#SWARM --gb-per-process 1  
#SWARM --gres=lscratch:10 
#SWARM --module sratoolkit 

fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314043
fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314044
fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314045

There is advice for generating a swarm file using a for loop and echo in the swarm user guide.

Let's run our swarm file. Because we included our directives within the swarm file, the only option we need to include is -f for file.

swarm -f set.swarm

Running Interactive Jobs

Interactive nodes are suitable for testing/debugging cpu-intensive code, pre/post-processing of data, graphical application, and to GUI interface to application.

To start an interactive node, type "sinteractive" at the command line "$" and press Enter/Return on your keyboard.

$ sinteractive

You will see something like this printed to your screen. You only need to use the sinteractive command once per session. If you try to start an interactive node on top of another interactive node, you will get a message asking why you want to start another node.

[username@biowulf ]$ sinteractive    
salloc.exe: Pending job allocation 34516111    
salloc.exe: job 34516111 queued and waiting for resources    
salloc.exe: job 34516111 has been allocated resources    
salloc.exe: Granted job allocation 34516111    
salloc.exe: Waiting for resource configuration    
salloc.exe: Nodes cn3317 are ready for job    
srun: error: x11: no local DISPLAY defined, skipping    
[username@cn3317 ]$

You can use many of the same options for sinteractive as you can with sbatch. The default sinteractive allocation is 1 core (2 CPUs) and 4 GB of memory and a walltime of 8 hours.

To terminate / cancel the interactive session use

exit

Exiting from Biowulf

To disconnect the remote connection on Biowulf, use

exit

So you think you know Biowulf?

Quiz yourself using the hpc.nih.gov biowulf-quiz.

Help Session

Let's submit some jobs on Biowulf.