Lesson 4: Working on Biowulf
Lesson 3 Review
- Flags and command options
- Wildcards (
*
) - Tab complete
- Accessing user history with the "up" and "down" arrows
cat
,head
, andtail
- Working with file content (input, output, and append)
- Combining commands with the pipe (
|
) grep
for loop
- File Permissions
Lesson Objectives
- Learn about the slurm system by working on Biowulf. Learn about batch jobs, swarms jobs, interactive sessions.
- Retrieve data from NCBI through a batch job.
- Learn how to troubleshoot failed jobs.
For this lesson, you will need to connect to Biowulf.
Working on Biowulf
Now that we are becoming more proficient at the command line, we can use these skills to do more than navigate our file system. We can actually begin working on Biowulf. Today's lesson will focus on submitting computational jobs on the Biowulf compute nodes.
Login using ssh
To get started, as always, we will need to log in to Biowulf. Make sure you are on VPN.
Open your Terminal
if you are using a mac or the Command prompt
if you are using a Windows machine.
username
= NIH/Biowulf login username. Remember to use the student account username here.
Type in your password at the prompt. The cursor will not move as you type your password!
When you log in to Biowulf, you are automatically in your home directory (/home/$USER
). This directory is very small and not suitable for large data files or analysis.
Use the cd
command to change to your data directory (/data
) .
where $USER
is an environment variable holding your username.
If you do not yet have one, create a directory to work in for this lesson and move to that directory.
Note
When working on Biowulf, you can not use any computational tools on the "login node". Instead, you need to work on a node or nodes that are sufficient for what you are doing.
To run jobs on Biowulf, you must designate them as interactive, batch or swarm. Failure to do this may result in a temporary account lockout.
What kind of work can we do on the login node?
The login node can be used for:
- Submitting resource intensive tasks as jobs
- Editing and compiling code
- File management and data transfers on a small scale
Batch Jobs
Most jobs on Biowulf should be run as batch jobs using the "sbatch" command.
Where yourscript.sh
is a shell script containing the job commands including input, output, cpus-per-task, and other steps. Batch scripts always start with #!/bin/bash
or similar call. The sha-bang (#!
) tells the computer what command interpreter to use, in this case the Bourne-again shell.
For example, to submit a job checking sequence quality using fastqc
(MORE ON THIS LATER), you may create a script named fastqc.sh
:
Inside the script, you may include something like this:
-o
names the output directory
-f
states the format of the input file(s)
and seqfile1 ... seqfileN
are the names of the sequence files.
Note
fastqc
is a available via Biowulf's module system, and so prior to running the command, the module had to be loaded.
For more information on running batch jobs on Biowulf, please see: https://hpc.nih.gov/docs/userguide.html.
Multi-threaded jobs and sbatch
options
In high-performance computing, multithreading lets a single program split itself into multiple "workers" that can run at the same time on different parts of the computer's brain (CPUs), speeding up complex tasks. Many bioinformatics programs use multi-threading.
For multi-threaded jobs, you will need to set --cpus-per-task
. You can do this at the command line or from within your script.
Example at the command line:
In your script:
#!/bin/bash
#SBATCH --job-name qc
#SBATCH --mail-type BEGIN,END
#SBATCH --cpus-per-task #
module load fastqc
fastqc -o output_dir -t $SLURM_CPUS_PER_TASK -f fastq seqfile1 seqfile2 ... seqfileN
Within the script we can use directives denoted by #SBATCH
to support command line arguments such as --cpus-per-task
. If included within the script, you will not need to call these at the command line when submitting the job. You should also pass the environment variable, $SLURM_CPUS_PER_TASK
to the thread argument. Some other useful directives include --job-name
, where you assign a name to the submitted job, and --mail-type
, which you can use to direct slurm
to send you an email when a job begins, ends, or both.
Tip
The jobscript should always be the last argument of sbatch
.
To see more sbatch
options, use
Some slurm commands
Once you submit a job, you will need to interact with the slurm
system to manage or view details about submitted jobs.
Here are some useful commands for this purpose:
Courtesy of NIH HPC Team Training Documentation
Standard error and standard output
Output that you would expect to appear in the terminal (e.g., standard error and standard output) will not in batch mode. Rather, these will be written by default to slurm######.out
in the submitting directory, where ######
represents the job id. These can be redirected using --output=/path/to/dir/filename
and --error=/path/to/dir/filename
on the command line or as an #SBATCH
directive.
Partitions
Your job may be in a waiting phase ("Pending" or "PD") depending on available resources. You can specify a particular node partition using --partition
.
Use freen
to see what's available.
Summary of partitions
Student partition
The student accounts have their own partition for running jobs, --partition=student
. This will need to be included on the command line or in the job script.
Walltime
The default walltime, or amount of time allocated to a job, is 2 hours on the norm
partition. To change the walltime, use --time=d-hh:mm:ss
.
Here are the walltimes by partition:
You can change the walltime after submitting a job using
Submit an actual job
Let's submit a batch job. We are going to download data from the Sequence Read Archive (SRA), a public repository of high throughput, short read sequencing data. We will discuss the SRA a bit more in detail in Lesson 5. For now, our goal is to simply download multiple fastq files associated with a specific BioProject on the SRA. We are interested in downloading RNAseq files associated with BioProject PRJNA578488, which "aimed to determine the genetic and molecular factors that dictate resistance to WNT-targeted therapy in intestinal tumor cells".
We will learn how to pull the files we are interested in directly from SRA at a later date. For now, we will use the run information stored in sra_files_PRJNA578488.txt
.
Let's copy this file to our working directory and view using less
.
Now, let's build a script downloading a single run, SRR10314042
, to a directory called /data/$USER/testscript
.
Open the text editor nano
and create a script named filedownload.sh
.
Inside our script we will type
#!/bin/bash
#SBATCH --cpus-per-task=6
#SBATCH --gres=lscratch:10
#SBATCH --partition=student
#load module
module load sratoolkit
fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314042
Notice our use of sbatch directives inside the script.
fasterq-dump
assigns 6 threads by default, so we are specifying --cpus-per-task=6
. This can be modified once we get an idea of the CPUs needed for the job. -t
assigns the location to be used for temporary files. Here, we are using /lscratch/$SLURM_JOB_ID
, which is created upon job submission. In combination, we need to request local scratch space allocation, which we are setting to 10 GB (--gres=lscratch:10
). The final directive, --partition=student
is required because we are using student accounts for this lesson. If you are with the Center for Cancer Research and using your own account, you can use --partition=ccr
.
Remember:
Default compute allocation = 1 physical core = 2 CPUs
Default Memory Per CPU = 2 GB. Therefore, default memory allocation = 4 GB
Now, let's run the script.
Let's check our job status.
Once the job status changes from PD
(pending) to R
(running), let's check the job status.
Some other useful job monitoring commands include jobload
, and jobhist
. The latter is useful for when the job completes.
When the job completes, we should have a file called SRR10314042.fastq
.
Now, what if we want to download all of the runs from sra_files_PRJNA578488.txt
. We could use a for loop, GNU parallel
, which acts similarly to a for loop (MORE ON THIS NEXT LESSON), or we can submit multiple fasterq-dump
jobs in a job array, one subjob per run accession.
Note
There are instructions for running SRA-Toolkit on Biowulf here.
Swarm-ing on Biowulf
Swarm
is for running a group of commands (job array) on Biowulf. swarm
reads a list of command lines and automatically submits them to the system as sub jobs. "By default, swarm runs one command per core on a node, making optimum use of a node. Thus, a node with 16 cores will run 16 commands in parallel.". To create a swarm
file, you can use nano
or another text editor and put all of your command lines in a file called file.swarm
(file
is just a placeholder). Then you will use the swarm
command to execute.
For example,
Note
By default, each subjob is allocated 1.5 gb of memory and 1 core (consisting of 2 cpus) --- hpc.nih.gov
Swarm creates two output files for each command line, one each for STDOUT (file.o) and STDERR (file.e). You can look into these files with the less
command to see any important messages.
For example,
View the swarm
options using
For more information on swarm-ing on Biowulf, please see: https://hpc.nih.gov/apps/swarm.html
Let's create a swarm job
To retrieve all the files at once, you can create a swarm job.
Copy the following and paste into the swarm file.#SWARM --threads-per-process 6
#SWARM --gb-per-process 4
#SWARM --gres=lscratch:20
#SWARM --module sratoolkit
#SWARM --partition=student
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SR10314043 #Add error to see this in job output
fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314044
fasterq-dump -t /lscratch/$SLURM_JOB_ID SRR10314045
There is advice for generating a swarm file using a for loop
and echo
in the swarm user guide. If you can help it, do not type each of these commands.
Something like this could generate the lines above:
cat ../sra_files_PRJNA578488.txt | while read line; do
echo 'fasterq-dump -v -t /lscratch/$SLURM_JOB_ID' $line >> script.sh;
done
Fix the code
You may have noticed that the while
loop above does not produce the same lines. The order of the files is in reverse, and there is an extra line of code.
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SRR10314045
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SRR10314044
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID SRR10314043
fasterq-dump -v -t /lscratch/$SLURM_JOB_ID
How can we fix the lines written to script.sh? Use the skills you have learned and a little help from google.
Let's run our swarm file. Because we included our directives within the swarm file, the only option we need to include is -f
for file.
Running Interactive Jobs
Interactive nodes are suitable for testing/debugging cpu-intensive code, pre/post-processing of data, graphical application, and to GUI interface to application.
To start an interactive node, type sinteractive
at the command line and press Enter/Return on your keyboard.
You will see something like this printed to your screen. You only need to use the sinteractive
command once per session. If you try to start an interactive node on top of another interactive node, you will get a message asking why you want to start another node.
[username@biowulf ]$ sinteractive
salloc.exe: Pending job allocation 34516111
salloc.exe: job 34516111 queued and waiting for resources
salloc.exe: job 34516111 has been allocated resources
salloc.exe: Granted job allocation 34516111
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn3317 are ready for job
srun: error: x11: no local DISPLAY defined, skipping
[username@cn3317 ]$
You can use many of the same options for sinteractive
as you can with sbatch
. The default sinteractive allocation is 1 core (2 CPUs) and 3 GB (1.5 GB per CPU) of memory and a walltime of 8 hours.
For example,
sinteractive --gres=lscratch:20 --cpus-per-task=6
module load sratoolkit
fasterq-dump -t /lscratch/$SLURM_JOBID SRR2048331 -O /data/$USER/sra
To terminate / cancel the interactive session use
Exiting from Biowulf
To disconnect the remote connection on Biowulf, use
Transferring files to Biowulf
Before ending this lesson, I want to briefly discuss file transfers. At some point you will need to know how to get your data on Biowulf. The NIH HPC Team has fantastic documentation on this subject at https://hpc.nih.gov/docs/transfer.html.
Remember, Helix (helix.nih.gov) is the interactive data transfer and file management node for the NIH HPC Systems. If you are interactively transferring files to and from NIH HPC directories, you should connect to Helix:
whereusername
is your Biowulf username.
Recommended methods for file transfers
Large scale transfers
For large scale transfers and large files use, Globus. Find instructions here.
Small scale transfers
For small scale transfers, use scp
to or from Helix or drag and drop files by mounting HPC Systems directories to your local machine.
For example, to copy Module_1
to your local machine, use:
Note
This command and the next are issued from your local computer. The .
in the above command means that you are copying Module_1
into your current working directory.
To copy from your local machine to Biowulf, use:
Make sure to substituteusername
with your Biowulf username.
To mount the HPC directories to your local computer, follow the instructions here.
Help Session
Practice the skills learned in this lesson here.
So you think you know Biowulf?
Quiz yourself using the hpc.nih.gov biowulf-quiz.