Skip to content

Reproducible R with Git

For scientists who analyze data through coding, versioning tools such as Git make tracking and undoing changes easy. See the images below shows if a versioning tool was not used, the number of files generated ane file names through iterations of a project will become cumbersome to keep track of. Researchers can avoid this by using versioning tools such as Git where the history of only one file is saved. Not only can Git be used for tracking changes locally on a personal computer, it also facilitates collaboration and code sharing on platforms such as GitHub. The goal of this Coding Club session is to get participants acquainted with versioning on local computer using Git and R Studio. Concepts learned can be easily applied to versioning via command line or other Integrated Devleopment Environments (IDE) such as Jupyter Lab and VS Code.

For an introduction to versioning with Git in general, see BTEP's Version control using Git class. This class focused on versioning on local computer using Git via the command line. In this coding club session however, participants will learn how to version using Git in R Studio.

Learning Objectives

After this class, participants will:

  • Understand the benefits of versioning.
  • Know the rationale for using R Studio projects.
  • Be informed of how to setup Git for versioning in R Studio.
  • Be able to perform versioning tasks such as:
    • Staging and committing changes.
    • View history of a file.
    • Revert to a previous version of a file.
    • Delete a file from tracking.
  • Share code on GitHub.

R Packages Used

  • usethis
  • gitcreds

Creating a R Studio project

The first step to versioning with Git and R studio is to create a project. Projects are useful as these are self contained folders that contain code and input. When shared with, collaborators can save R Studio projects in any folder on local computer. Then, as long as the project is opened in R Studio, they can re-run a script. In other words, projects create relative paths to the code and input so they can be run regardless where they are stored on local disk drive.

In this example, a project called git_in_rstudio will be created in the instructor's local Documents folder. The first step is to click "File" in the R Studio menu and select "New Project".

In the subsequent dialogue box, choose to create the project in a new working directory. There are options to create a project from existing folder or open an already versioned project.

Next, choose to start a "New Project".

Finally, enter the name of the project directory (ie. git_in_rstudio) and then click "Create Project".

In the top right corner of the R Studio window, users will see the project once it has been created. Files and directories within the project folder appears in the "File" pane of R Studio. Users can list all of the files that Git should not track in .gitignore.

Note

"Git uses the .git folder to store all the information about the project, including the tracked files and sub-directories located within the project’s directory. If we ever delete the .git subdirectory, we will lose the project’s history." -- https://swcarpentry.github.io/git-novice/03-create.html.

After creating the project, start a new R script and name it git_in_rstudio.R.

If the "Create a git repository" option was not available when creating the project, that means R Studio does not know the path to Git on the computer. However users can still create the project.

To resolve this issue, go to the command line and type one of the following to find the path to Git.

For Macs:

which git

For instance, on the instructor's local computer, the path is /opt/homebrew/bin/git.

For Windows:

where git

Then, goto "Tools" in the R Studio menu and select "Git/SVN" to specify the path to the Git executable.

After the path to the Git executable has been set, navigate to "Tools" in the R Studio menu and select "Version Control" and then "Project Setup".

In the subsequent dialogue box, choose "Git/SVN" from the left side menu. Then select Git as the version control system.

Hit "Yes" to confirm the set up of a new Git repository.

Users can perform versioning tasks such as commit, push, pull, or view history for individual files from bar.

Project wide versioning tasks can be performed in R Studio's Environment pane by clicking on the Git tab. Here, the .gitignore and .Rproj files are visible along with scripts.

Setting up Version Control for R Studio Project (usethis)

But because this class is focused on using Git with R Studio, it would be more appropriate to introduce the usethis package. Goto the console in R Studio and load the usethis package. If not installed, just do library(usethis).

library(usethis)

Next, find out what the use_git command does.

?use_git

In the R Studio Help pane, the following information about use_git will be shown.

Initialise a git repository
Description
use_git() initialises a Git repository and adds important files to .gitignore. If user consents, it also makes an initial commit.

Usage
use_git(message = "Initial commit")
Arguments
message 
Message to use for first commit.
use_git()

Users will be asked if it is okay to do an initial commit and include a commit message, select no for now. Then, select yes to restarting R Studio to complete setting up the Git repository in the project.

Summary

The above sections demonstrated the first steps in versioning using Git in R Studio. These include creating a R Studio project and initiating a Git repository. The term repository just describes a collection of files whose history is tracked by Git.

Configuring Git

It is important to set some configurations for Git so that it can keep track of who made changes and how to contact in case questions arise. The use_git_config command from usethis allows for setting configurations.

?use_git_config
Configure Git
Description
Sets Git options, for either the user or the project ("global" or "local", in Git terminology). Wraps gert::git_config_set() and gert::git_config_global_set(). To inspect Git config, see gert::git_config().

Usage
use_git_config(scope = c("user", "project"), ...)
Arguments
scope   
Edit globally for the current user, or locally for the current project

... 
Name-value pairs, processed as <dynamic-dots>.

Value
Invisibly, the previous values of the modified components, as a named list.

See Also
Other git helpers: use_git(), use_git_hook(), use_git_ignore()

Examples
Run examples

## Not run: 
# set the user's global user.name and user.email
use_git_config(user.name = "Jane", user.email = "jane@example.org")

The use_git_config construct below adds the user's name and email to the configurations.

use_git_config(user.name="first.last", user.email="user@nih.gov")

Staging and Commiting Changes

Open the blank git_in_rstudio.R script and check check to box under the "Staged" column next to the script git_in_rstudio.R and git_in_rstudio.Rproj to stage these two files for commit. Once the files have been staged, users will see the status icon change from "?" to "A", where "A" stands for added.

Note

"The staging area is composed of file(s) that Git should track the history of (but no history has been saved at that point)" --https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/CC2024/version_control_git_cli/version_control_git_cli/.

Next, hit the "Commit" button. Committing is the process of writing and saving changes to file(s). Be sure include an informative commit message at this step. Hit the "Commit` button when ready.

After the commit, users will see a message indicating the number of file that were changed. Close this window when ready to review change.

In the review changes window, users can see the history, commit message, date and time of commit, as well as author who made the commit. The number labeled "SHA" is the commit ID. Note that in the subject column, there is word "HEAD", which refers to "the most recent commit to the current checkout branch" (source: https://www.geeksforgeeks.org/git-head/)

Tracking History

Add the following comment line to git_in_rstudio.R. Hit save afterwards and notice that the script is now under the M or modified status. Check the "Staged" box to add these changes for commit. Remember to add a meaningful commit message.

# This script demonstrates versioning using Git in R Studio.

In the review changes window, users can:

  • See that HEAD is now pointing to the most current commit along with the commit message.
  • The lines that were added (highlighted in green).
  • View file for any commit by clicking on "View file @ ########", where "########" is the ID that correspond to a commit.

Next, add the following and the commit.

# Load data
data(mtcars)

After committing the above changes, users will see that the code in the current version is highlighted in green while that for the previos is highlighted in red.

# Take a look at mtcars
View(mtcars)

Lines that are deleted in the current commit are highlighted in red. For instance, add the following to git_in_rstudio.R.

# Load tidyverse
library(tidyverse)

# Get average mpg by for each cylinder
mtcars %>% group_by(cyl) %>% summarise(mean_mpg=mean(mpg))

Then, delete the following and commit again.

# Take a look at mtcars
View(mtcars)

Finally, add the code below to generate a box plot showing the distribution of mpg by number of cylinders. Commit when done.

# Look at distribution of mpg by cylinder using box plot
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg))+geom_boxplot()+xlab("number of cylinders")

Sharing Code on GitHub

Suppose that calculating the average mpg by cylinder and obtaining a box plot to show the mpg distrition by cylinder statisfies the goal of analysis, users can share this code on GitHub.

Setting GitHub Credentials

If users have not done so, create a GitHub token using the command below from the usethis package.

create_github_token()

Tokens can also be generated from the GitHub website. See https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens to learn how.

library(gitcreds)
gitcreds_set()

Pushing Code to GitHub

Then to push to GitHub, use the following from usethis.

use_github()

Deleting GitHub Credentials

For Macs, once the GitHub credentials has been set using gitcreds_set(), the token gets stored in the key chain. Use gitcreds_delete() to remove GitHub credential from memory.

Push Code to GitHub using GUI

Users can also use the green "Push" button to send to code to GitHub.

Add the following code then save and commit the changes.

# Correlation between displacement and quarter mile time
ggplot(mtcars, aes(x=hp,y=qsec))+geom_point()

Click on the green "Push" button on the envrionmental pane to send the updated git_in_rstudio.R script to GitHub.

Mac users will have to enter their computer password when prompted.

The message shown in the screen capture below will appear when the code has been successfully pushed to GitHub.

Delete a File from Tracking

Create a script called git_in_rstudio1.R and add the following. Save, stage, commit, and add to GitHub.

data("ToothGrowth")
library(tidyverse)
ggplot(ToothGrowth, aes(x=as.factor(dose), y=len, color=supp))+geom_boxplot()

Suppose that the user wants to remove git_in_rstudio1.R from tracking, just check in the "File" pane and hit "Delete". Notice that a red "D" appears next to the deleted file in the Git pane.

Check the Staged column to stage git_in_rstudio1.R for commit and then go ahead and commit the changes.

A message will appear saying that git_in_rstudio1.R has been removed. Make that the git_in_rstudio1.R script was removed as well.

Restoring Deleted File

To restore git_in_rstudio1.R but not track it, click on "History" in the Git pane and select the commit for the file that the user wants to restore. Click on "View file @" in the version comparison box to pull up the script.

Once the script is pulled up, click the save icon to save.

Branches and Merging

# Get average quarter mile time by cylinder
mtcars %>% group_by(cyl) %>% summarise(mean_qsec=mean(qsec))

Click "Pull"

>>> /opt/homebrew/Cellar/git/2.49.0/bin/git pull --rebase
From https://github.com/JWrows2014/git_in_rstudio
 * [new branch]      development1 -> origin/development1
Already up to date.
git checkout development1 git_in_rstudio.R

Useful learning resources