Lesson 4: Submitting R Scripts via command line
Learning Objectives
- Learn how to use R with less interaction
- Learn how to deploy
sbatch
R jobs, and learn about alternatives such asswarm
. - Learn about R job parallelization in the context of Biowulf
We have organized our R project directory and have set up renv
to make our R environment a bit more reproducible. Now, we need to learn how to submit an R script. Thus far, we have been using R interactively by first obtaining an interactive compute node (sinteractive
). However, we can submit R scripts without interaction using sbatch
and swarm
. This is advantageous as we may want to include our R Script in a pipeline or process thousands of files.
Running R scripts from the command line can be a powerful way to:
- Automate your R scripts
- Integrate R into production
- Call R through other tools or systems--- Nathan Stephens, Posit Support
Example scripts
We will use a couple of example scripts in this section (DESeq2_airway.R
, Volcano.R
). The first script uses the R package airway
, which contains data from Himes et al. 2014, a bulk RNA-Seq study, as a Ranged SummarizedExperiment. The Bioconductor package DESeq2
is then used to produce differential expression results. This R script largely follows a Bioconductor workflow on RNA-seq. The second script (Volcano.R
) takes output from the first script and makes a volcano plot using the package EnhancedVolcano
.
Warning
These scripts are for example only. You should not use them to apply to your own data.
Running R from command line
Before jumping into submitting scripts in job files, let's first focus on how to run R from the command line.
The primary way to run R from the command line is to call Rscript
. Rscript
is a binary front-end to R, to use for scripting applications; basically, it is a convenience function.
Let's see this in action first in an interactive session:
sinteractive --gres=lscratch:5
module load R/4.2.2
Let's use our renv_test
directory to see how this works. The syntax is Rscript [options] file [args]
.
cd /data/$USER/R_on_Biowulf/renv_test
Rscript DESeq2_airway.R > DESeq2_airway.out
The default is --no-echo
, which makes R run as quietly as possible, and --no-restore
, which indicates that we do not want anything restored (e.g., objects, history, etc.) also imply --no-save
, meaning the workspace will not be saved. Here, we have no additional options
or args
.
As a convenience function, Rscript
is the same as calling
R --no-echo --no-restore --no-save --file=DESeq2_airway.R > DESeq2_airway2.out
Note
We have been using >
to direct stdout to a file. We can also use <
to direct the input file. See below.
R --no-echo --no-restore --no-save < DESeq2_airway.R > DESeq2_airway3.out
Info: Rscript --help
You can learn more about Rscript
using Rscript --help
and R --help
. Notice from R --help
that you can also use R CMD BATCH
to run an R script from command line. To run a script from the R console, use source()
.
Saving R output
Notice that we can easily save R output directed to standard output using >
. However, this will exclude messages, warnings, and errors, which are directed to standard error. For stdout you can specify 1> or >
; for stderr you can specify 2>
, and for both in a single file you can specify &>
. See here and here for more information.
Adding command line arguments
R scripts can be run from the command line with command line arguments. Here is a great resource from software carpentry explaining command line arguments.
To use command line arguments with an R script, we leverage commandArgs()
. This function creates a vector of command line arguments. When using trailingOnly = TRUE
, commandArgs()
only returns arguments after R -no-echo --no-restore --file --args
.
Let's see how this works in a simple script that returns a volcano plot of our differential expression results. First, let's copy over the Volcano.R
script to our test directory, renv_test
.
cp /data/classes/BTEP/R_on_Biowulf_2023/scripts/Volcano.R .
The contents of Volcano.R
:
# Create a Volcano Plot from DESeq2 differential expression results ----
library(EnhancedVolcano)
library(dplyr)
## set command line arguments ----
args <- commandArgs(trailingOnly = TRUE)
#stop the script if no command line argument
if(length(args)==0){
print("Please include differential expression results!")
stop("Requires command line argument.")
}
## Read in data ----
data<-read.csv(args[1],row.names=1) %>% filter(!is.na(padj))
labs<-head(row.names(data),5)
## Plot ----
EnhancedVolcano(data,
title = "Enhanced Volcano with Airways",
lab = rownames(data),
selectLab=labs,
labSize=3,
drawConnectors = TRUE,
x = 'log2FoldChange',
y = 'padj')
ggsave("./figures/Volcano.png",width=5.5,height=3.5,units="in",dpi=300,scale=2)
How can we run this from the command line?
Rscript Volcano.R ./outputs/deseq2_DEGs.csv
The easiest way to checkout the output of this function (Volcano.png
) is to mount our HPC system directories locally.
Info: Packages used to parse command-line arguments
There are also several packages that can be used to parse command-line arguments such as getopt
, optparse
,optigrab
, argparse
, docopt
, GetoptLong
.
Rendering Rmarkdown files from command line
In addition to R scripts, we can render Rmarkdown files directly from the command line by adding an R expression (an object that represents an action that can be performed by R) directly to our Rscript
command using the -e
expression flag.
cp /data/classes/BTEP/R_on_Biowulf_2023/rmarkdown/Volcano.Rmd .
cp ./outputs/deseq2_DEGs.csv DEGs.csv
Rscript -e "rmarkdown::render('Volcano.Rmd',params=list(args = 'DEGs.csv'))"
To make this work, parameters had to be added to the yaml of the Rmarkdown.
Using sbatch
R batch jobs are similar to any other batch job. A batch script ('rjob.sh') is created that sets up the environment and runs the R code. --- R/Bioconductor on Biowulf
Default allocations for an sbatch
job include:
2 CPUs with a default memory per CPU of 2 GB. Therefore, the default memory allocation is 4 GB.
More about sbatch
sbatch
is used to submit batch jobs, which are resource provisions that run applications on compute nodes and do not require supervision or interaction. To submit a batch job, a job script containing a list of unix commands to be executed by the job is typically required. This script may also include resource requirements (job directives) telling the job scheduler what types of resources are needed for the job. While bash shell scripting is typically used to write these files. Other shells can also be used.
Features of job scripts:
- if using a bash shell, the file typically ends in
.sh
. - File content starts with a shebang (
#!
) followed by the path to the interpreter (/bin/bash
) on the first line. - Content may include SLURM job directives denoted by
#SBATCH
at the beginning of the script directly following#!/bin/bash
. These can provide information to the Biowulf batch system such as:- Partition (default = "norm",
--partition
) - Name of the job (
--job-name
) - What types of job status notifications to send (
--mail-type
) - Where to send job status notification (
--mail-user
) - Memory to allocate (
--mem
) - Time to allocate (
--time
) - cpus per tasks (# of threads if multithreaded) (
--cpus-per-task
)
- Partition (default = "norm",
- Following
#SBATCH
directives, you can include comments throughout your list of commands using#
.
See important sbatch flags here and complete options with sbatch --help
.
Submitting the R script as a job using sbatch
.
We will create and submit a job script using sbatch
that will run the R scripts in the project we created in Lesson 3 (MyNewProject
).
Example job script:
nano rjob.sh
#!/bin/bash
#SBATCH --gres=lscratch:5
#SBATCH --mail-type=BEGIN,END
#Load the R module
module load R/4.2.2
#change to project directory
cd /data/$USER/R_on_Biowulf/MyNewProject
#Run R scripts using Rscript
Rscript ./R/DESeq2_airway.R
Rscript ./R/Volcano.R ./outputs/deseq2_DEGs.csv
Ctrl+O
, return
, Ctrl+X
The R script should be run in the project directory (MyNewProject
) to take advantage of renv
.
We included the job directives --gres=lscratch:5
and --mail-type=BEGIN,END
. --gres=lscratch:5
ensures that we have 5 GB of lscratch space for temporary storage. --mail-type=BEGIN,END
directs the job scheduler to send us an email when the job starts and ends. This email will by default go to your NIH email.
Note: stdout & stderr
For an sbatch job, a stdout and stderr file is automatically generated (by default, slurm######.out in the submitting directory). This can be modified using the following sbatch flags / directives (--output=/path/to/dir/filename
, --error=/path/to/dir/filename
).
Note: command line flags vs directives
You can also include job flags at the time of job submission. If these conflict with #SBATCH directives, the command line flags take priority.
Let's submit the script.
sbatch rjob.sh
rjob.sh
) is correct.
Using swarm
Swarm is a way to submit multiple commands to the Biowulf batch system and each command will be run as an independent job with identical resources, allowing for parallelization.
- Swarm scripts have the extension *.swarm.
- Lines that start with #SWARM are not run as a part of the script but these are directives that tells the Biowulf batch system what resoures (ie. memory, time, temporary storage, modules) are needed.
See here for submitting R swarm jobs.
Rswarm
There is also a utility Rswarm
that may interest you in specific cases.
Rswarm is a utility to create a series of R input files from a single R (master) template file with different output filenames and with unique random number generator seeds. It will simultaneously create a swarm command file that can be used to submit the swarm of R jobs. Rswarm was originally developed by Lori Dodd and Trevor Reeve with modifications by the Biowulf staff.
Rswarm
is great for simulations; see an example use case of rswarm here.
Parallelizing code
Can you speed up your code with parallelization?
Considerations:
-
levels of parallelization: multiprocessing vs multithreads
The most common form of parallelism in R is multiprocessing. This is usually explicitly done by you or package you are using. There are are some parts of base R and the underlying math libraries that can multithread which is mostly implicit parallelism. You can check if your code can take advantage of that. You can allocate for example 4 CPUs and then run your script with different settings of the
$OMP_NUM_THREADS
or$MKL_NUM_THREADS
environment variable. If you see a significant speed up and the dashboard data shows that it used multiple CPUs then it's worth using more than one CPU.It is important to always test parallel efficiency and monitor actual usage of CPUs and memory with the dashboard or using the dashboard_cli command. For running jobs there is also
jobload
. --- R on Biowulf, NIH HPC Team -
Can the job be split into multiple independent processes? If yes, consider an R swarm job.
- Are there functions in the code that support multiple threads? If so, you can take advantage of multi-threading.
- Is there a
lapply/sapply
function? Consider replacing withmclapply
. - Is there 'for' loop? Consider using
foreach
for parallel execution.
You may find this resource on parallelizing R code, helpful.
However, see tips from the NIH HPC R/Bioconductor documentation for specific considerations on:
1. Using the parallel package
2. Using the BiocParallel package
3. Implicit multi-threading
Info: Pitfalls around parallelizing R Code
Some R packages will detect all cores on a node even if they are not allocated (e.g. parallel::detectCores()
). You should use parallelly::availableCores()
to detect allocated CPUs. --- R on Biowulf, HPC Team
See specific examples regarding parallelization and troubleshooting in the NIH HPC training R on Biowulf.
Need help running your R code on Biowulf?
If you experience difficulties with running R on Biowulf, you should:
- Read the
R
docs on Biowulf. - Contact the HPC team at staff@hpc.nih.gov
- Attend monthly HPC walk-in virtual consultations
Also, please feel free to email us at ncibtep@nih.gov