Lesson 4: Submitting R Scripts via command line

Learning Objectives

Learn how to use R with less interaction
Learn how to deploy sbatch R jobs, and learn about alternatives such as swarm.
Learn about R job parallelization in the context of Biowulf

We have organized our R project directory and have set up renv to make our R environment a bit more reproducible. Now, we need to learn how to submit an R script. Thus far, we have been using R interactively by first obtaining an interactive compute node (sinteractive). However, we can submit R scripts without interaction using sbatch and swarm. This is advantageous as we may want to include our R Script in a pipeline or process thousands of files.

Running R scripts from the command line can be a powerful way to:

Automate your R scripts

Integrate R into production

Call R through other tools or systems--- Nathan Stephens, Posit Support

Example scripts

We will use a couple of example scripts in this section (DESeq2_airway.R, Volcano.R). The first script uses the R package airway, which contains data from Himes et al. 2014, a bulk RNA-Seq study, as a Ranged SummarizedExperiment. The Bioconductor package DESeq2 is then used to produce differential expression results. This R script largely follows a Bioconductor workflow on RNA-seq. The second script (Volcano.R) takes output from the first script and makes a volcano plot using the package EnhancedVolcano.

Warning

These scripts are for example only. You should not use them to apply to your own data.

Running R from command line

Before jumping into submitting scripts in job files, let's first focus on how to run R from the command line.

The primary way to run R from the command line is to call Rscript. Rscript is a binary front-end to R, to use for scripting applications; basically, it is a convenience function.

Let's see this in action first in an interactive session:

sinteractive --gres=lscratch:5  
module load R/4.2.2

Let's use our renv_test directory to see how this works. The syntax is Rscript [options] file [args] .

cd /data/$USER/R_on_Biowulf/renv_test  
Rscript DESeq2_airway.R > DESeq2_airway.out

The default is --no-echo, which makes R run as quietly as possible, and --no-restore, which indicates that we do not want anything restored (e.g., objects, history, etc.) also imply --no-save, meaning the workspace will not be saved. Here, we have no additional options or args.

As a convenience function, Rscript is the same as calling

R --no-echo --no-restore --no-save --file=DESeq2_airway.R  > DESeq2_airway2.out

Note

We have been using > to direct stdout to a file. We can also use < to direct the input file. See below.

R --no-echo --no-restore --no-save < DESeq2_airway.R > DESeq2_airway3.out

Info: Rscript --help

You can learn more about Rscript using Rscript --help and R --help. Notice from R --help that you can also use R CMD BATCH to run an R script from command line. To run a script from the R console, use source().

Saving R output

Notice that we can easily save R output directed to standard output using >. However, this will exclude messages, warnings, and errors, which are directed to standard error. For stdout you can specify 1> or >; for stderr you can specify 2>, and for both in a single file you can specify &>. See here and here for more information.

Adding command line arguments

R scripts can be run from the command line with command line arguments. Here is a great resource from software carpentry explaining command line arguments.

To use command line arguments with an R script, we leverage commandArgs(). This function creates a vector of command line arguments. When using trailingOnly = TRUE, commandArgs() only returns arguments after R -no-echo --no-restore --file --args.

Let's see how this works in a simple script that returns a volcano plot of our differential expression results. First, let's copy over the Volcano.R script to our test directory, renv_test.

cp /data/classes/BTEP/R_on_Biowulf_2023/scripts/Volcano.R .

The contents of Volcano.R:

# Create a Volcano Plot from DESeq2 differential expression results ----
library(EnhancedVolcano)
library(dplyr)

## set command line arguments ----
args <- commandArgs(trailingOnly = TRUE)

#stop the script if no command line argument
if(length(args)==0){
  print("Please include differential expression results!")
  stop("Requires command line argument.")
}

## Read in data ----
data<-read.csv(args[1],row.names=1) %>% filter(!is.na(padj))

labs<-head(row.names(data),5)

## Plot ----
EnhancedVolcano(data,
                title = "Enhanced Volcano with Airways",
                lab = rownames(data),
                selectLab=labs,
                labSize=3,
                drawConnectors = TRUE,
                x = 'log2FoldChange',
                y = 'padj')


ggsave("./figures/Volcano.png",width=5.5,height=3.5,units="in",dpi=300,scale=2)

This script requires a single argument, a .csv file containing our differential expression results.

How can we run this from the command line?

Rscript Volcano.R ./outputs/deseq2_DEGs.csv

The easiest way to checkout the output of this function (Volcano.png) is to mount our HPC system directories locally.

Info: Packages used to parse command-line arguments

There are also several packages that can be used to parse command-line arguments such as getopt, optparse,optigrab, argparse, docopt, GetoptLong.

Rendering Rmarkdown files from command line

In addition to R scripts, we can render Rmarkdown files directly from the command line by adding an R expression (an object that represents an action that can be performed by R) directly to our Rscript command using the -e expression flag.

cp /data/classes/BTEP/R_on_Biowulf_2023/rmarkdown/Volcano.Rmd .
cp ./outputs/deseq2_DEGs.csv DEGs.csv  

Rscript -e "rmarkdown::render('Volcano.Rmd',params=list(args = 'DEGs.csv'))"

To make this work, parameters had to be added to the yaml of the Rmarkdown.

Using sbatch

R batch jobs are similar to any other batch job. A batch script ('rjob.sh') is created that sets up the environment and runs the R code. --- R/Bioconductor on Biowulf

Default allocations for an `sbatch` job include:

2 CPUs with a default memory per CPU of 2 GB. Therefore, the default memory allocation is 4 GB.

More about `sbatch`

sbatch is used to submit batch jobs, which are resource provisions that run applications on compute nodes and do not require supervision or interaction. To submit a batch job, a job script containing a list of unix commands to be executed by the job is typically required. This script may also include resource requirements (job directives) telling the job scheduler what types of resources are needed for the job. While bash shell scripting is typically used to write these files. Other shells can also be used.

Features of job scripts:

if using a bash shell, the file typically ends in .sh.
File content starts with a shebang (#!) followed by the path to the interpreter (/bin/bash) on the first line.
Content may include SLURM job directives denoted by #SBATCH at the beginning of the script directly following #!/bin/bash. These can provide information to the Biowulf batch system such as:
- Partition (default = "norm", --partition)
- Name of the job (--job-name)
- What types of job status notifications to send (--mail-type)
- Where to send job status notification (--mail-user)
- Memory to allocate (--mem)
- Time to allocate (--time)
- cpus per tasks (# of threads if multithreaded) (--cpus-per-task)
Following #SBATCH directives, you can include comments throughout your list of commands using #.

See important sbatch flags here and complete options with sbatch --help.

Submitting the R script as a job using `sbatch`.

We will create and submit a job script using sbatch that will run the R scripts in the project we created in Lesson 3 (MyNewProject).

Example job script:

nano rjob.sh

Paste the following:

#!/bin/bash
#SBATCH --gres=lscratch:5
#SBATCH --mail-type=BEGIN,END

#Load the R module
module load R/4.2.2

#change to project directory  
cd /data/$USER/R_on_Biowulf/MyNewProject

#Run R scripts using Rscript  
Rscript ./R/DESeq2_airway.R 
Rscript ./R/Volcano.R ./outputs/deseq2_DEGs.csv

Ctrl+O, return, Ctrl+X

The R script should be run in the project directory (MyNewProject) to take advantage of renv.

We included the job directives --gres=lscratch:5 and --mail-type=BEGIN,END. --gres=lscratch:5 ensures that we have 5 GB of lscratch space for temporary storage. --mail-type=BEGIN,END directs the job scheduler to send us an email when the job starts and ends. This email will by default go to your NIH email.

Note: stdout & stderr

For an sbatch job, a stdout and stderr file is automatically generated (by default, slurm######.out in the submitting directory). This can be modified using the following sbatch flags / directives (--output=/path/to/dir/filename, --error=/path/to/dir/filename).

Note: command line flags vs directives

You can also include job flags at the time of job submission. If these conflict with #SBATCH directives, the command line flags take priority.

Let's submit the script.

sbatch rjob.sh

This job script can be submitted from any location as long as the path to the script (rjob.sh) is correct.

Using swarm

Swarm is a way to submit multiple commands to the Biowulf batch system and each command will be run as an independent job with identical resources, allowing for parallelization.

Swarm scripts have the extension *.swarm.
Lines that start with #SWARM are not run as a part of the script but these are directives that tells the Biowulf batch system what resoures (ie. memory, time, temporary storage, modules) are needed.

See here for submitting R swarm jobs.

Rswarm

There is also a utility Rswarm that may interest you in specific cases.

Rswarm is a utility to create a series of R input files from a single R (master) template file with different output filenames and with unique random number generator seeds. It will simultaneously create a swarm command file that can be used to submit the swarm of R jobs. Rswarm was originally developed by Lori Dodd and Trevor Reeve with modifications by the Biowulf staff.

Rswarm is great for simulations; see an example use case of rswarm here.

Parallelizing code

Can you speed up your code with parallelization?

Considerations:

levels of parallelization: multiprocessing vs multithreads

The most common form of parallelism in R is multiprocessing. This is usually explicitly done by you or package you are using. There are are some parts of base R and the underlying math libraries that can multithread which is mostly implicit parallelism. You can check if your code can take advantage of that. You can allocate for example 4 CPUs and then run your script with different settings of the $OMP_NUM_THREADS or $MKL_NUM_THREADS environment variable. If you see a significant speed up and the dashboard data shows that it used multiple CPUs then it's worth using more than one CPU.

It is important to always test parallel efficiency and monitor actual usage of CPUs and memory with the dashboard or using the dashboard_cli command. For running jobs there is also jobload. --- R on Biowulf, NIH HPC Team
Can the job be split into multiple independent processes? If yes, consider an R swarm job.
Are there functions in the code that support multiple threads? If so, you can take advantage of multi-threading.
Is there a lapply/sapply function? Consider replacing with mclapply.
Is there 'for' loop? Consider using foreach for parallel execution.

You may find this resource on parallelizing R code, helpful.

However, see tips from the NIH HPC R/Bioconductor documentation for specific considerations on:
1. Using the parallel package
2. Using the BiocParallel package
3. Implicit multi-threading

Info: Pitfalls around parallelizing R Code

Some R packages will detect all cores on a node even if they are not allocated (e.g. parallel::detectCores()). You should use parallelly::availableCores() to detect allocated CPUs. --- R on Biowulf, HPC Team

See specific examples regarding parallelization and troubleshooting in the NIH HPC training R on Biowulf.

Need help running your R code on Biowulf?

If you experience difficulties with running R on Biowulf, you should:

Read the R docs on Biowulf.
Contact the HPC team at staff@hpc.nih.gov
Attend monthly HPC walk-in virtual consultations

Also, please feel free to email us at ncibtep@nih.gov