Lesson 6: Submitting batch jobs and transferring between local machine and Biowulf

Quick review:

In the previous lesson, we learned to request an interactive session on Biowulf so that we can perform more compute intensive tasks such as downloading sequencing data from NCBI SRA and subsequently assessing quality of the downloaded sequencing data. Commands that we learned include

sinteractive (to request an interactive session)
module avail (to view a list of software that are installed on Biowulf)
module whatis (to obtain details about a specific software)
module load (to load software)
fastq-dump (to download sequencing data from NCBI SRA)
fastqc (to assess sequencing data quality)

Lesson objectives:

After this lesson, we will

Be able to develop a short shell script that will
- download sequencing data from NCBI SRA
- assay the quality of the sequencing data
Know how to submit shell scripts as a batch job.
Be able to transfer files between our local computer and Biowulf.

Unix commands that we will visit in this lesson

nano (to open the nano editor to edit files)
sbatch (to submit jobs to Biowulf)
squeue (to check job status)
scancel (to cancel jobs)
scp (to copy content between local computer and Biowulf)

Creating shell scripts and submitting batch jobs

First, sign into Biowulf

ssh username@biowulf.nih.gov

Change into the data directory

cd /data/username

Let's then create a folder called SRR1553606_fastqc to store the FASTQ files for SRR1553606 and its fastqc output.

mkdir SRR1553606_fastqc

Next, change into SRR1553606_fastqc

cd SRR1553606_fastqc

Let's create a shell script called SRR1553606_fastqc.sh. Shell scripts have extension ".sh". This script will accomplish the following

Download sequencing data (FASTQ files) for NCBI SRA accession SRR1553606. The study in which SRR1553606 was derived used paired end sequencing methods so we should get two FASTQ files.
After downloading the FASTQ files, we will run fastqc to assess sequencing data quality.

Creating SRR1553606_fastqc.sh using the nano editor

Nano is a built-in Unix text editor. We can use nano to create SRR1553606_fastqc.sh. The syntax here is nano file_to_edit so in this example do the following.

nano SRR1553606_fastqc.sh

This will open a blank editor (since SRR1553606_fastqc.sh is a new file). At the top of the editor we see the name of the file that is opened in nano.

Figure 1: Nano opens a blank editor for us to start constructing SRR1553606_fastqc.sh.

Copy and paste the code below into the editor

#!/bin/bash
#SBATCH --job-name=SRR1553606_fastqc
#SBATCH --mail-type=ALL
#SBATCH --mail-user=wuz8@nih.gov
#SBATCH --mem=1gb
#SBATCH --partition=student
#SBATCH --time=00:05:00
#SBATCH --output=SRR1553606_fastqc_log
#SBATCH --gres=lscratch:5

export TMPDIR=/lscratch/$SLURM_JOB_ID

#LOAD REQUIRED MODULES
module load sratoolkit
module load fastqc

#USE FASTQ-DUMP FROM THE SRATOOLKIT TO DOWNLOAD FASTQ FILES FROM NCBI SRA ACCESSION SRR1553606
#FOR PAIRED END SEQUENCING, --split-files WILL WRITE THE FORWARD AND REVERSE READS TO SEPARATE FILES,SO
##WE WILL END UP WITH TWO FASTQ FILES IN THIS DOWNLOAD
#-X 10000 WILL ONLY DOWNLOAD THE FIRST 10000 READS
fastq-dump --split-files -X 10000 SRR1553606

#RUN FASTQC
fastqc -o /data/$USER/SRR1553606_fastqc SRR1553606*.fastq

After pasting the code into the editor, hit control-x and a message shows up at the bottom of the editor and it asks whether we like to save the file. Hit "Y" for yes in this case.

Figure 2: Hit control-x after editing to exit nano. It will ask whether you like to save.

Next, we will be asked what name to save the file as. Hit enter to save as SRR1553606_fastqc.sh.

Figure 3: If we select to save, nano will ask which file name we like to save to.

Now, let's break down SRR1553606_fastqc.sh

Lines that start with "#" are comments and are not run as a part of the script
A shell script starts with #!/bin/bash, where
- "#!" is known as the sha-bang
- following "#!", is the path to the command interpreter (ie. /bin/bash)
Next, we have lines that start with #SBATCH. Because these lines start with "#", they will not be run as a part of the script. However, these lines are important because they instruct Biowulf on when and where to send job notification as well as what resources need to be allocated.
- job-name: (name of the job)
- mail-type: (type of notification emails to receive, here we want Biowulf to email to us all notifications regarding a job)
- mail-user: (where to send notification emails, this should be your NIH email)
- mem: (RAM or memory required, we want 1gb)
- partition: (which partition to use; student accounts will need to use the student partition; if you are using your own Biowulf account, please remove this line)
- time: (how much time should be alloted for the job, we want 5 minutes)
- output: (name of the log file)
- gres: (request lscratch space of 5 gb - recall the sratoolkit requires us to allocate lscratch space because it has to write temporary files)
export TMPDIR=/lscratch/$SLURM_JOB_ID sets the TMPDIR or temporary directory path to lscratch. Temporary files will be written to TMPDIR.
Load sratoolkit and fastqc using module load
Download the FASTQ data using fastq-dump from sratoolkit
Run fastqc - from the previous class, we know that the files downloaded are SRR1553606_1.fastq and SRR1553606_2.fastq so we can use the "*" to denote wildcard or anything between the accession (SRR1553606) and the file extension ".fastq" to provide input for fastqc. We specify the output directory using the -o option and provide the path /data/$USER/SRR1553606_fastqc where $USER is the environmental variable that points to your Biowulf username.

To submit the shell script as a job, we will do

sbatch SRR1553606_fastqc.sh

After submission, we can use squeue -u username to check on job status.

squeue -u username

For instance, below is the status of a job that I submitted, the job status (ST column) is PD (pending). Note that because I am not signed on with a student account, I can use the "norm" partition for this job.

[wuz8@biowulf wuz8]$ squeue -u wuz8
   JOBID  PARTITION     NAME  USER  ST  TIME  NODES  NODELIST(REASON)
55421634       norm unix_on_  wuz8  PD  0:00      1  (None)

To cancel a job, use scancel followed by the job id

scancel job_id

Figure 4 and Figure 5 shows the emails that Biowulf will send us upon the start of a submitted job and then when it completes. Because we set --mailtype=ALL, Biowulf will email to inform us of all statuses regarding a batch job. For instance, if we cancel a job, Biowulf will send us an email (Figure 6).

Figure 4: Email sent by Biowulf to our NIH emails when a job starts.

Figure 5: Email sent by Biowulf to our NIH emails when a job completes.

Figure 6: Email sent by Biowulf to our NIH emails when a job is cancelled.

Note that we can always go into the Biowulf User Dashboard to get information about our jobs. For those on student accounts use the student dashboard but sign in with your own NIH username not the student1, student2, etc. usernames. At the User Dashboard, click on "Job Info" and a table listing the jobs will appear (Figure 7). Clicking on a job (ie. 55599045 shown in Figure 7) we will see useful information such as a plot of memory and CPU usage (Figure 8), which is important because depending on the usage, we can fine tune the amount of resources that we request.

Figure 7: A listing of student1's jobs in the Biowulf User Dashboard.

Figure 8: Memory and CPU usage for one of student1's jobs in the Biowulf User Dashboard.

Transferring data between Biowulf and local machine

Now that we have generated FASTQC reports for SRR1553606, we need to transfer them to our local machine so we can view in a web browser. There are many options for transferring between our computer and Biowulf. See Transferring data to and from the NIH HPC systems for details.

Options for transferring between local and Biowulf are listed below.

Mount HPC drives to local machine (slow)
Command line tools
- secure copy (scp)
- sftp
Use graphical sftp or scp client
- WinSCP
- Fugu
- Mobaxterm
Globus is recommended for transferring large datasets between local and HPC

Biowulf also provides options for transferring to and from NIH Box and NIH OneDrive. See Transferring data between NIH Box or NIH OneDrive and HPC systems or Globus on NIH HPC (Biowulf) for instructions.

Some of the data transfer options involve installing new software. Submitting a ticket with service.cancer.gov will help you get that done.

Biowulf file transfer tutorials on YouTube:

Transferring data using Globus

Globus is the recommended tool for transferring large datasets between local computer and HPC. Refer to https://hpc.nih.gov/docs/globus/ for instructions on how to setup your Globus account. Use Chrome when working with Globus.

Once you have setup your Globus account, goto https://www.globus.org/ to log in (see Figure 9.)

Figure 9: Log in to your Globus account at https://www.globus.org/.

Clicking on the Log In button (Figure 9) will take a you to a page where you can search for your institution or organization. It auto suggests so when your institution or organization (ie. National Institutes of Health) shows up, select it (Figure 10).

Figure 10: Select your institution or organization.

Once you have selected your institution or organization, click on Continue (Figure 11) and you will be directed to complete the Globus log in process by entering your NIH credentials (Figure 12).

Figure 11: Click continue to proceed to entering your NIH credentials.

Figure 12: Complete the Globus log in process by entering your NIH credentials.

Hit agree at the screen that appears after you have provided your NIH credentials (Figure 13). This takes to you the Globus File Manager (Figure 14).

Figure 13: Hit agree to proceed.

Figure 14: Globus file manager.

At the Gloubs file manager (Figure 15), search for your local file collection (see https://hpc.nih.gov/docs/globus/ for creating a local collection). Next, select Transfer or Sync to and the File Manager splits into two panes. On the pane to right (Figure 16), search for NIH HPC Data Transfer (Biowulf) collection and this will take you to your Biowulf home directory, but you can change into your data directory below the search box. Select the files that you want to transfer and hit Start to transfer to local. Upon a successful transfer, the files should show up on the pane to the left, which is your local file collection.

Figure 15: Search for the contents of your local computer at the Globus File Manager.

Figure 16: Find the NIH HPC Data Transfer (Biowulf), change into you Biowulf data directory, select files and hit Start to transfer to local.

Transferring data using scp

For the remainder of this lesson, we will learn how to use secure copy (scp) to transfer the FASTQC html reports from SRR1553606 to our local computers to view. While the syntax for scp is the same whether we use Windows or Mac, there are subtle differences like the way directory paths are expressed in Windows versus Mac. Below, we break down the instructions for using scp for both operating systems.

Directory path structure: Mac versus Windows

Before using the scp command to transfer data from Biowulf to local computer, we should understand the directory path structure in Macs and Windows.

Mac directory path structure

Mac directory path structure follows that of Unix. For instance, the absolute path of the local Downloads folder follows the structure below. This path is absolute because it is starting at the root, which is denoted by "/" at the beginning of the path. Replace username with your NIH username since this is what you use to sign onto your local machine.

/Users/username/Downloads

Windows directory path structure

The path to the local Downloads folder in Windows is shown below. It starts with the name of the disk that the folder is in. Windows uses letters in the alphabet to name disks. Replace username with your NIH username since this is what you use to sign onto your local machine. Note that Windows directory paths start with the name of the disk drive followed by a ":". Also, the parts of a directory path are separated by "\" and not "/" as seen Macs and Unix.

disk_name:\Users\username\Downloads

scp for Windows users

We are going to place the FASTQC html reports for SRR1553606 in our Windows Downloads folder. The next step is to change into the Windows Downloads folder. Upon opening the command prompt, you should be in the disk_drive:\Users\username folder, where

disk_drive is a letter in the alphabet (ie. O in Figure 9). This disk drive is that in which Windows is installed and is also known as the systems drive.
username is your NIH username since you are now working locally and you use your NIH username to log into your local computer (in Figure 17, you see wuz8 because that is my NIH username)

Figure 17: Upon opening the Windows Command Prompt, you will land in your Users\username folder (where username is your NIH username).

At the Windows Command Prompt, type dir to list the contents of the Users\username folder and you will see a subfolder called Downloads (Figure 9). In Unix, we used ls to list directory content. For those using Windows, dir is used to list directory content in the Command Prompt and it is just one of the many commands used in the Microsoft Disk Operating System or MS-DOS.

dir

Next, type cd Downloads to change into your local Windows Downloads directory. Here, we see a common command between Unix and MS-DOS (ie. cd).

cd Downloads

In the Windows Downloads folder, use the the following commands to transfer the FASTQC reports from Biowulf. Again, "." at the end of the commands denotes "here, in the current directory". Enter your the password you used to sign into Biowulf if prompted.

scp username@helix.nih.gov:/data/username/SRR1553606_fastqc/SRR1553606_1_fastqc.html .

scp username@helix.nih.gov:/data/username/SRR1553606_fastqc/SRR1553606_2_fastqc.html .

After download completes, the two FASTQC html reports for SRR1553606 will appear in the Windows Downloads folder and you can view these using a web browser (Figure 18).

Figure 18: FASTQC html reports for SRR1553606 appear in the Windows Downloads folder after successful scp.

scp for Mac users

Mac users will need to open a Terminal window. Then change into the local Downloads folder which should be /Users/username/Downloads (where username should be your NIH username).

cd /Users/username/Downloads

If you cannot remember the username for your government furnished Mac, then do the following, where "~" denotes home directory because technically the Downloads folder is located in the home directory which is /Users/username (where again, username is your NIH username). Another option is to type echo $USER to find the username for your government furnished Mac.

cd ~/Downloads

Once in our Downloads folder, we do the following to copy over the FASTQC html files. Where username, is your NIH username or the student accounts (ie. student1, student2, student3, etc.). Enter your password when prompted.

scp username@helix.nih.gov:/data/username/SRR1553606_fastqc/SRR1553606_1_fastqc.html .

scp username@helix.nih.gov:/data/username/SRR1553606_fastqc/SRR1553606_2_fastqc.html .

If scp was successful, we should see the two FASTQC html reports for SRR1553606 in the Mac Downloads folder (Figure 19).

Figure 19: FASTQC html reports for SRR1553606 appear in the Mac Downloads folder after successful scp.

To place SRR1553606_1_fastqc.html and SRR1553606_2_fastqc.html from local to Biowulf using `scp, do the following (we will put it in our data folder). Note, we should stay in our local Downloads directory for this. Enter your password if prompted.

scp SRR1553606_1_fastqc.html username@helix.nih.gov:/data/username

scp SRR1553606_2_fastqc.html username@helix.nih.gov:/data/username