Lesson 3: R Project Management and renv
Learning objectives
- Discuss the importance of reproducibility
- Learn ways to make R analyses more reproducible
- Learn how to set up and organize an R project
- Learn how to use
renv
for R package management
What is the 2023 NIH Data Management and Sharing Policy?
Effective January 25, 2023, the NIH released the 2023 NIH Data Management and Sharing Policy. This policy requires that NIH intramural researchers plan for data management and sharing prior to conducting scientific research. To do this, scientists are required to submit a Data Management and Sharing plan and comply with the approved plan. While the policy highlights types of data that should be managed and shared and provides links to further resources, it does not provide any guidance on the management and sharing of code needed to truly replicate an analysis.
Sharing data and reporting on analysis steps is not enough to reproduce scientific results.
Figure from Peng 2012, Science, doi: 10.1126/science.1213847.
On the reproducibility spectrum, we should strive for "Full replication". Ultimately, this includes making an analysis executable with a fully functioning computational environment. We aren't going to get that far today. However, we will discuss some ways to organize data, code, and package dependencies to improve data analysis sharing and collaboration.
How can we make our R analyses more reproducible?
There are many ways to increase collaboration and document and share code and results using R.
Some examples include:
-
RMarkdown / Quarto (i.e., literate programming)
R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more. --- R4DS
Note
Quarto
is the next generation of R Markdown with new and enhanced features.RMarkdown and Quarto can be used to communicate analysis steps and results. They can specifically be used to:
- Create a data science lab notebook
- Share and report results to collaborators and others via specific output formats (e.g., html, pdf, etc.)
- Create a dashboard of results (via flexdashboard)
Tip
Always include a code chunk calling
sessionInfo()
at the end of your RMarkdown file. This will yield crucial information about your R session, including your operating system requirements, R version, and package versions. -
R Project
RStudio projects are self-contained project directories that include the data, code, outputs, and other related files to reproduce an analysis. When you use relative file paths (relative to the project directory), it is fairly easy to reproduce any results within the project.
We are going to leverage the benefits of an R project to enhance reproducibility.
-
Version Control (e.g. Git) (Recommended)
Version control is a great way to enhance data management and collaboration. When you use version control, you can easily track changes that you make to your code and eliminate the need for multiple copies of a script (e.g., Final, Finalv2, Final_final, etc.). Version control is easy to use with R packages and R projects.
Info
For more information on using Git with R, check out https://happygitwithr.com/index.html and https://raps-with-r.dev/git.html.
-
R package
R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. While the primary repository for R packages is CRAN, you can also readily distribute R packages directly from GitHub.
Here are some resources if interested in bundling your analysis in an R package:
-
Containerization
Cointanerizing a computational environment using Docker or Singularity freezes the computational environment, including operating system, so that results are truly reproducible.
Other tips for reproducible programming
-
Incorporate functional programming
- Elminate repetitive code with well-written functions, making code easier to test, document, and share.
- Make code as independent from the global environment as possible.
-
Use literate programming
- Rmarkdown, quarto, etc. to generate parameterized reports.
Today we will focus on organizing our data, code, documents in an R project.
R Project Management and renv
R projects allow us to easily share data, code, and other related information, but this only scratches the surface of what is required for true data analysis reproducibility. We won't take all steps to make our project reproducible today, but beyond basic project organization, it is fairly easy to document and manage package dependencies.
Too often an R script will fail simply due to a clash in package dependencies. Versions are important. R versions change over time; Bioconductor versions evolve, and R packages change. While we can include the sessionInfo()
at the end of a script or markdown file, this in no way facilitates our ability to truly replicate the infrastructure surrounding our code. Thankfully, there are R packages available that help us do just that.
Check out this chapter from R 4 Data Science.
Introducing renv (reproducible environments)
The renv package is a new effort to bring project-local R dependency management to your projects. The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors.
Underlying the philosophy of renv is that any of your existing workflows should just work as they did before – renv helps manage library paths (and other project-specific state) to help isolate your project’s R dependencies, and the existing tools you’ve used for managing R packages (e.g. install.packages(), remove.packages()) should work as they did before.--- renv
In a nut shell, renv
will allow us to recreate our sessionInfo()
. However, it is not perfect, and does require extra storage due to the creation of a per project library.
Note
renv
does not manage R versions. You will need to make sure you are using an appropriate version of R to recreate an R project library. Because Biowulf uses module environments for R installations, this isn't a huge hurdle.
Main functions
Creates a local library of R packages copying what you used from your project.
The primary functions and workflow is as follows:
-
renv::init()
initialize the project to be used with renv and creates a project libraryThis is only required once. Once initialized, you work in the project as normal.
renv::init()
will detect package dependencies based onlibrary()
andrequire()
in R scripts found in the R project.Note
You can intialize a project without dependency discovery and installation using
renv::init(bare=TRUE)
. -
renv::snapshot()
updates renv.lock file, saving the state of the project library. -
renv::restore()
restores the state of R environment to replicate what is in lock file.
Getting Started: Setting up our R Project
Connect to Biowulf, obtain an interactive session, load R
Let's connect remotely to Biowulf.
ssh username@biowulf.nih.gov
enter
.
Navigate to your /data/$USER
directory.
cd /data/$USER
Get an interactive session.
sinteractive --gres=lscratch:5
Let's make a class directory.
mkdir R_on_Biowulf
cd R_on_Biowulf
Load R version 4.2.2.
module load R/4.2.2
Note
R/4.3.0 became the default R installation as of May 2023.
Set up an R project
Things to consider:
- R Projects are generally created with intent to use with RStudio; you do not need to create an "R project" to organize a project directory.
- When creating a project directory:
- Create a consistent directory structure with the top level as the project directory
- All inputs and outputs (where possible) should be contained within a project directory
- "never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you" R4ds
We will not be using an IDE but we will create an R project using the R package usethis
, which is accessible via devtools
. usethis
is a "package that facilitates interactive workflows for R project creation and development".
Create the R project
#open R
R
#create project
usethis::create_project(path = "MyNewProject", open = TRUE, rstudio = FALSE)
#quit R
q()
v Creating 'MyNewProject/'
v Setting active project to '/vf/users/emmonsal/R_on_Biowulf/MyNewProject'
v Creating 'R/'
v Writing a sentinel file '.here'
* Build robust paths within your project via `here::here()`
* Learn more at <https://here.r-lib.org>
When prompted:
Save workspace image? [y/n/c]: n
The arguments open = TRUE
activates the new project and establishes a new working directory. rstudio = FALSE
establishes a .here
file that allows the project directory to be recognized as the top level of a project.
ls
Now, we will see our new directory MyNewProject
. Let's copy our R scripts to our new project directory.
cd MyNewProject
cp /data/classes/BTEP/R_on_Biowulf_2023/scripts/*.R ./R
Initialize and activate renv in the project
Cache directory set-up
First, let's set up our renv
cache location. renv
uses a global cache to reduce duplicate installs of packages across projects.
When using renv with the global package cache, the project library is instead formed as a directory of symlinks (or, on Windows, junction points) into the renv global package cache. --- renv
By default the renv
cache will be created in your home directory, which can quickly fill up if using Bioconductor packages. We are going to instead create a cache in our /data/$USER
directory.
mkdir -p /data/$USER/.cache/R/renv
Create a .Renviron file within your home directory.
nano ~/.Renviron
RENV_PATHS_ROOT=/data/$USER/.cache/R/renv
Replace
$USER
with your actual username.
Use ctrl+O
to save, return
, and ctrl+X
to exit.
Important
The renv cache only needs to be set up once regardless of the version of R you are using as long as you created a user level .Renviron file establishing its location.
Run renv::init()
Once we have done this, we can activate renv within our project. But, first, let's verify the location of our renv cache.
R
renv::paths$cache() # Check the cache location
renv::init(bioconductor = "3.16") #initialize renv in the project
Info
We can initialize renv with a specific version of Bioconductor. This eliminates later headaches as Bioconductor updates to newer versions. See here for more information.
Warning
This step takes about 5-10 minutes.
Now that we have initialized renv
with this project. Let's check our R library paths.
.libPaths()
You should see the renv
project library listed first, meaning it is prioritized over the module (site) libraries.
[1] "/vf/users/$USER/R_on_Biowulf/MyNewProject/renv/library/R-4.2/x86_64-pc-linux-gnu"
[2] "/usr/local/apps/R/4.2/site-library_4.2.2"
[3] "/usr/local/apps/R/4.2/4.2.2/lib64/R/library"
The library snapshot resulted in an error due to GenomeInfoDb [installed 1.35.14 != latest 1.34.9]
.
Let's update the installation of GenomeInfoDb
as suggested by the prompt.
Packages from Bioconductor can be installed by using the bioc:: prefix. ---renv vignette
renv::install("bioc::GenomeInfoDb")
Note
Without specifying a version of Bioconductor to be used with a project (e.g., renv::init(bioconductor = "3.16")
), install("bioc::GenomeInfoDb")
will attempt to install the latest-available version from Bioconductor (v.3.17).
Call renv::snapshot()
to save the state of the project library to the lockfile.
renv::snapshot()
We see a long list of packages being written to the lockfile and the following message:
The version of R recorded in the lockfile will be updated:
- R [* -> 4.2.2]
Do you want to proceed? [y/N]:
Type y
.
* Lockfile written to '/vf/users/$USER/R_on_Biowulf/MyNewProject/renv.lock'.
This was successful and the lockfile was written. The lockfile is necessary to restore the project at a later date.
Establish a consistent project structure
Now that we have renv
set up with our project, let's also establish a project structure.
Let's exit R and edit our .Rprofile
.
Note
When we ran renv::init()
a local .Rprofile
file was created with the code source("renv/activate.R")
. This code is necessary "to automatically load and use the private [renv] library for new R sessions launched from the project root directory" (renv).
q()
Save workspace image? [y/n/c]: n
nano .Rprofile
Delete the single line in the .Rprofile file, and paste the following, which was borrowed from a blog on data management, into the Rproject .Rprofile:
.First <- function() {
dir.create(paste0(getwd(), "/figures"), showWarnings = F)
dir.create(paste0(getwd(), "/outputs"), showWarnings = F)
dir.create(paste0(getwd(), "/data"), showWarnings = F)
dir.create(paste0(getwd(), "/docs"), showWarnings = F)
if (!("renv" %in% list.files())) {
renv::init()
} else {
source("renv/activate.R")
}
cat("\nWelcome to your R-Project:", basename(getwd()), "\n")
}
ctrl + O
to save, return
, ctrl+X
to exit.
This code creates several directories (i.e., figures, outputs, data, and docs) and initializes the project for use with the renv
package using renv::init
or if already initialized activates the renv project library using renv/activate.R
. This will not overwrite directories that have already been created.
Note
You can change these directory names to whatever works best with your organization style. The key, however, is to stay as consistent as possible across projects.
Info
Using version control (git via GitHub) is an even better way to manage data and share inputs, code, and results. You can easily manage a Github repository or create a new repository using the usethis
package. Also, check out this resource for understanding more regarding version control and R.
We will need to start a new R session for the .Rprofile to take effect.
R
q()
Save workspace image? [y/n/c]: n
Now we are ready to work with our project files.
Test it
Before we end today's lesson, let's test out renv
.
We will create a new directory and transfer our R script and lock file. We will then restore the project and run our R script.
#change director to /data/$USER
cd ..
#make test directory
mkdir renv_test
#copy files to test directory
cp MyNewProject/renv.lock renv_test/
cp MyNewProject/R/DESeq2_airway.R renv_test/
#change directory to test directory
cd renv_test
#load R module if not already loaded and start R session
module load R/4.2.2
R
#Restore renv library
renv::restore()
Now, we can run our script.
source("DESeq2_airway.R")
Note
The R version used to create a new project with the MyNewProject renv.lock file must be the same.
Also, because the required packages were already in our renv cache, updating the test library was much faster.
Next Lesson
In the final lesson, we will learn more about submitting jobs using R on Biowulf.
Acknowledgements
- Making your analysis portable and reproducible from DIY transcriptomics.
- Efficient Data Management in R.
- R Packages with renv