Lesson 3: R Project Management and renv

Learning objectives

Discuss the importance of reproducibility
Learn ways to make R analyses more reproducible
Learn how to set up and organize an R project
Learn how to use renv for R package management

Effective January 25, 2023, the NIH released the 2023 NIH Data Management and Sharing Policy. This policy requires that NIH intramural researchers plan for data management and sharing prior to conducting scientific research. To do this, scientists are required to submit a Data Management and Sharing plan and comply with the approved plan. While the policy highlights types of data that should be managed and shared and provides links to further resources, it does not provide any guidance on the management and sharing of code needed to truly replicate an analysis.

Sharing data and reporting on analysis steps is not enough to reproduce scientific results.

Peng 2012, *Science*, doi: 10.1126/science.1213847

Figure from Peng 2012, Science, doi: 10.1126/science.1213847.

On the reproducibility spectrum, we should strive for "Full replication". Ultimately, this includes making an analysis executable with a fully functioning computational environment. We aren't going to get that far today. However, we will discuss some ways to organize data, code, and package dependencies to improve data analysis sharing and collaboration.

How can we make our R analyses more reproducible?

There are many ways to increase collaboration and document and share code and results using R.

Some examples include:

RMarkdown / Quarto (i.e., literate programming)

R Markdown provides an unified authoring framework for data science, combining your code, its results, and your prose commentary. R Markdown documents are fully reproducible and support dozens of output formats, like PDFs, Word files, slideshows, and more. --- R4DS

Note

Quarto is the next generation of R Markdown with new and enhanced features.

RMarkdown and Quarto can be used to communicate analysis steps and results. They can specifically be used to:
1. Create a data science lab notebook
2. Share and report results to collaborators and others via specific output formats (e.g., html, pdf, etc.)
3. Create a dashboard of results (via flexdashboard)
Tip

Always include a code chunk calling sessionInfo() at the end of your RMarkdown file. This will yield crucial information about your R session, including your operating system requirements, R version, and package versions.
R Project

RStudio projects are self-contained project directories that include the data, code, outputs, and other related files to reproduce an analysis. When you use relative file paths (relative to the project directory), it is fairly easy to reproduce any results within the project.

We are going to leverage the benefits of an R project to enhance reproducibility.
Version Control (e.g. Git) (Recommended)

Version control is a great way to enhance data management and collaboration. When you use version control, you can easily track changes that you make to your code and eliminate the need for multiple copies of a script (e.g., Final, Finalv2, Final_final, etc.). Version control is easy to use with R packages and R projects.

Info

For more information on using Git with R, check out https://happygitwithr.com/index.html and https://raps-with-r.dev/git.html.
R package

R packages are loadable extensions that contain code, data, documentation, and tests in a standardized shareable format that can easily be installed by R users. While the primary repository for R packages is CRAN, you can also readily distribute R packages directly from GitHub.

Here are some resources if interested in bundling your analysis in an R package:
Containerization

Cointanerizing a computational environment using Docker or Singularity freezes the computational environment, including operating system, so that results are truly reproducible.

Other tips for reproducible programming

Incorporate functional programming
1. Elminate repetitive code with well-written functions, making code easier to test, document, and share.
2. Make code as independent from the global environment as possible.
Use literate programming
1. Rmarkdown, quarto, etc. to generate parameterized reports.

Today we will focus on organizing our data, code, documents in an R project.

R Project Management and `renv`

R projects allow us to easily share data, code, and other related information, but this only scratches the surface of what is required for true data analysis reproducibility. We won't take all steps to make our project reproducible today, but beyond basic project organization, it is fairly easy to document and manage package dependencies.

Too often an R script will fail simply due to a clash in package dependencies. Versions are important. R versions change over time; Bioconductor versions evolve, and R packages change. While we can include the sessionInfo() at the end of a script or markdown file, this in no way facilitates our ability to truly replicate the infrastructure surrounding our code. Thankfully, there are R packages available that help us do just that.

Check out this chapter from R 4 Data Science.

Introducing renv (reproducible environments)

The renv package is a new effort to bring project-local R dependency management to your projects. The goal is for renv to be a robust, stable replacement for the Packrat package, with fewer surprises and better default behaviors.

Underlying the philosophy of renv is that any of your existing workflows should just work as they did before – renv helps manage library paths (and other project-specific state) to help isolate your project’s R dependencies, and the existing tools you’ve used for managing R packages (e.g. install.packages(), remove.packages()) should work as they did before.--- renv

In a nut shell, renv will allow us to recreate our sessionInfo(). However, it is not perfect, and does require extra storage due to the creation of a per project library.

Note

renv does not manage R versions. You will need to make sure you are using an appropriate version of R to recreate an R project library. Because Biowulf uses module environments for R installations, this isn't a huge hurdle.

Main functions

Creates a local library of R packages copying what you used from your project.

The primary functions and workflow is as follows:

renv::init() initialize the project to be used with renv and creates a project library

This is only required once. Once initialized, you work in the project as normal.

renv::init() will detect package dependencies based on library() and require() in R scripts found in the R project.

Note

You can intialize a project without dependency discovery and installation using renv::init(bare=TRUE).
renv::snapshot() updates renv.lock file, saving the state of the project library.
renv::restore() restores the state of R environment to replicate what is in lock file.

Getting Started: Setting up our R Project

Connect to Biowulf, obtain an interactive session, load R

Let's connect remotely to Biowulf.

ssh username@biowulf.nih.gov

Enter your password and hit enter.

Navigate to your /data/$USER directory.

cd /data/$USER

Get an interactive session.

sinteractive --gres=lscratch:5

Let's make a class directory.

mkdir R_on_Biowulf
cd R_on_Biowulf

Load R version 4.2.2.

module load R/4.2.2

Note

R/4.3.0 became the default R installation as of May 2023.

Set up an R project

Things to consider:

R Projects are generally created with intent to use with RStudio; you do not need to create an "R project" to organize a project directory.
When creating a project directory:
- Create a consistent directory structure with the top level as the project directory
- All inputs and outputs (where possible) should be contained within a project directory
- "never use absolute paths in your scripts, because they hinder sharing: no one else will have exactly the same directory configuration as you" R4ds

We will not be using an IDE but we will create an R project using the R package usethis, which is accessible via devtools. usethis is a "package that facilitates interactive workflows for R project creation and development".

Create the R project

#open R
R 
#create project
usethis::create_project(path = "MyNewProject", open = TRUE, rstudio = FALSE)
#quit R
q()

v Creating 'MyNewProject/'
v Setting active project to '/vf/users/emmonsal/R_on_Biowulf/MyNewProject'
v Creating 'R/'
v Writing a sentinel file '.here'
* Build robust paths within your project via `here::here()`
* Learn more at <https://here.r-lib.org>

When prompted:
Save workspace image? [y/n/c]: n

The arguments open = TRUE activates the new project and establishes a new working directory. rstudio = FALSE establishes a .here file that allows the project directory to be recognized as the top level of a project.

ls

Now, we will see our new directory MyNewProject. Let's copy our R scripts to our new project directory.

cd MyNewProject
cp /data/classes/BTEP/R_on_Biowulf_2023/scripts/*.R ./R

Initialize and activate renv in the project

Cache directory set-up

First, let's set up our renv cache location. renv uses a global cache to reduce duplicate installs of packages across projects.

When using renv with the global package cache, the project library is instead formed as a directory of symlinks (or, on Windows, junction points) into the renv global package cache. --- renv

By default the renv cache will be created in your home directory, which can quickly fill up if using Bioconductor packages. We are going to instead create a cache in our /data/$USER directory.

mkdir -p /data/$USER/.cache/R/renv

Create a .Renviron file within your home directory.

nano ~/.Renviron

Add the following line:
RENV_PATHS_ROOT=/data/$USER/.cache/R/renv
Replace $USER with your actual username.

Use ctrl+O to save, return, and ctrl+X to exit.

Important

The renv cache only needs to be set up once regardless of the version of R you are using as long as you created a user level .Renviron file establishing its location.

Run renv::init()

Once we have done this, we can activate renv within our project. But, first, let's verify the location of our renv cache.

R
renv::paths$cache() # Check the cache location
renv::init(bioconductor = "3.16") #initialize renv in the project

Info

We can initialize renv with a specific version of Bioconductor. This eliminates later headaches as Bioconductor updates to newer versions. See here for more information.

Warning

This step takes about 5-10 minutes.

Now that we have initialized renv with this project. Let's check our R library paths.

.libPaths()

You should see the renv project library listed first, meaning it is prioritized over the module (site) libraries.

[1] "/vf/users/$USER/R_on_Biowulf/MyNewProject/renv/library/R-4.2/x86_64-pc-linux-gnu"  
[2] "/usr/local/apps/R/4.2/site-library_4.2.2"                                           
[3] "/usr/local/apps/R/4.2/4.2.2/lib64/R/library"

The library snapshot resulted in an error due to GenomeInfoDb [installed 1.35.14 != latest 1.34.9].

Let's update the installation of GenomeInfoDb as suggested by the prompt.

Packages from Bioconductor can be installed by using the bioc:: prefix. ---renv vignette

renv::install("bioc::GenomeInfoDb")

Note

Without specifying a version of Bioconductor to be used with a project (e.g., renv::init(bioconductor = "3.16")), install("bioc::GenomeInfoDb") will attempt to install the latest-available version from Bioconductor (v.3.17).

Call renv::snapshot() to save the state of the project library to the lockfile.

renv::snapshot()

We see a long list of packages being written to the lockfile and the following message:

The version of R recorded in the lockfile will be updated:
- R                      [* -> 4.2.2]

Do you want to proceed? [y/N]:

Type y.

* Lockfile written to '/vf/users/$USER/R_on_Biowulf/MyNewProject/renv.lock'.

This was successful and the lockfile was written. The lockfile is necessary to restore the project at a later date.

Establish a consistent project structure

Now that we have renv set up with our project, let's also establish a project structure.

Let's exit R and edit our .Rprofile.

Note

When we ran renv::init() a local .Rprofile file was created with the code source("renv/activate.R"). This code is necessary "to automatically load and use the private [renv] library for new R sessions launched from the project root directory" (renv).

q()  
Save workspace image? [y/n/c]: n
nano .Rprofile

Delete the single line in the .Rprofile file, and paste the following, which was borrowed from a blog on data management, into the Rproject .Rprofile:

.First <- function() {
  dir.create(paste0(getwd(), "/figures"), showWarnings = F)
  dir.create(paste0(getwd(), "/outputs"), showWarnings = F)
  dir.create(paste0(getwd(), "/data"), showWarnings = F)
  dir.create(paste0(getwd(), "/docs"), showWarnings = F)

  if (!("renv" %in% list.files())) {
    renv::init()
  } else {
    source("renv/activate.R")
  }

  cat("\nWelcome to your R-Project:", basename(getwd()), "\n")
}

ctrl + O to save, return, ctrl+X to exit.

This code creates several directories (i.e., figures, outputs, data, and docs) and initializes the project for use with the renv package using renv::init or if already initialized activates the renv project library using renv/activate.R. This will not overwrite directories that have already been created.

Note

You can change these directory names to whatever works best with your organization style. The key, however, is to stay as consistent as possible across projects.

Info

Using version control (git via GitHub) is an even better way to manage data and share inputs, code, and results. You can easily manage a Github repository or create a new repository using the usethis package. Also, check out this resource for understanding more regarding version control and R.

We will need to start a new R session for the .Rprofile to take effect.

R
q()

Save workspace image? [y/n/c]: n

Now we are ready to work with our project files.

Test it

Before we end today's lesson, let's test out renv.

We will create a new directory and transfer our R script and lock file. We will then restore the project and run our R script.

#change director to /data/$USER
cd ..
#make test directory
mkdir renv_test
#copy files to test directory
cp MyNewProject/renv.lock renv_test/
cp MyNewProject/R/DESeq2_airway.R renv_test/  
#change directory to test directory
cd renv_test
#load R module if not already loaded and start R session  
module load R/4.2.2 
R
#Restore renv library
renv::restore()

Now, we can run our script.

source("DESeq2_airway.R")

Note

The R version used to create a new project with the MyNewProject renv.lock file must be the same.

Also, because the required packages were already in our renv cache, updating the test library was much faster.

Next Lesson

In the final lesson, we will learn more about submitting jobs using R on Biowulf.

Acknowledgements

Making your analysis portable and reproducible from DIY transcriptomics.
Efficient Data Management in R.
R Packages with renv

Lesson 3: R Project Management and renv

Learning objectives

What is the 2023 NIH Data Management and Sharing Policy?

How can we make our R analyses more reproducible?

R Project Management and renv

Introducing renv (reproducible environments)

Main functions

Getting Started: Setting up our R Project

Connect to Biowulf, obtain an interactive session, load R

Set up an R project

Create the R project

Initialize and activate renv in the project

Cache directory set-up

Run renv::init()

Establish a consistent project structure

Test it

Next Lesson

Acknowledgements

R Project Management and `renv`