Skip to content

Reproducible R with Git

For scientists who analyze data through coding, versioning tools such as Git make tracking and undoing changes easy. The image below shows that if versioning was not done, the number of files and file names created through iterations of a project will become cumbersome to keep track of. Researchers can avoid this by using versioning tools such as Git where the history of one file is saved throughout the project. Not only can Git be used for tracking changes locally on a personal computer, it also facilitates collaboration and code sharing on platforms such as GitHub. The goal of this Coding Club session is to get participants acquainted with versioning on local computer using Git and R Studio. Concepts learned can be easily applied to versioning via command line or other Integrated Development Environments (IDE) such as Jupyter Lab and VS Code.

For an introduction to versioning with Git, see BTEP's Version control using Git class, which focused on versioning on local computer using Git via the command line. In this Coding Club session however, participants will learn to version using Git in R Studio. Fundamental principals are the same regardless whether versioning is done through command line or R Studio. However, the ability to use Git within R Studio makes it convenient to track changes in R coding projects.

Tip

Git can be used to track different file types including plain text, tabular data in the form of tab separated (TSV) or comma separated (CSV), markdown (including Quarto), scripts, and Jupyter Notebooks.

Learning Objectives

After this class, participants will:

  • Understand the benefits of versioning.
  • Know the rationale for using R Studio projects.
  • Be informed of how to setup Git for versioning in R Studio.
  • Be able to perform versioning tasks such as:
    • Staging and committing changes.
    • Viewing history of a file.
    • Reverting to a previous version of a file.
    • Deleting a file from tracking.
  • Share code on GitHub.

R Packages Used

The following R packages will be used in this class.

  • usethis
  • gitcreds

Creating a R Studio Project

The first step to versioning with Git in R studio is to create a project. Projects are useful as these are self contained folders that contain code and input. When shared, collaborators can save R Studio projects in any folder on local computer, open the project in R Studio, and re-run a script. In other words, projects create relative paths to the code and input so they can be run regardless where they are stored on local disk.

In this example, a project called git_in_rstudio will be created in the instructor's local computer Desktop directory. The first step is to click "File" in the R Studio menu and select "New Project".

In the subsequent dialogue box, choose to create the project in a new working directory. There are options to create a project from existing folder or download a project from GitHub.

Next, choose to start a "New Project".

Finally, enter the name of the project (ie. git_in_rstudio) and then click "Create Project".

In the top right corner of the R Studio window, users will see the project once it has been created. Files and directories within the project folder appears in the "File" pane of R Studio. Users can list all of the files that Git should not track in .gitignore.

Note

"Git uses the .git folder to store all information about the project, including the tracked files and sub-directories located within the project’s directory. If we ever delete the .git subdirectory, we will lose the project’s history." -- https://swcarpentry.github.io/git-novice/03-create.html.

After creating the project, start a new R script and name it git_in_rstudio.R.

If the "Create a git repository" option was not available when creating the project, that means R Studio does not know the path to Git on the computer. However users can still create the project.

To resolve this issue, go to the command line and type one of the following to find the path to Git.

For Macs:

which git

For instance, on the instructor's local computer, the path is /opt/homebrew/bin/git.

For Windows:

where git

Then, goto "Tools" in the R Studio menu and select "Git/SVN" to specify the path to the Git executable. Note that SVN (also known as Subversion) is another versioning system but for this class will focus solely on Git.

After the path to the Git executable has been set, navigate to "Tools" in the R Studio menu and select "Version Control" and then "Project Setup".

In the subsequent dialogue box, choose "Git/SVN" from the left side menu. Then select Git as the version control system. Click "Ok" and then hit "Yes" to confirm the set up of a new Git repository.

Users can perform versioning tasks such as commit, push, pull, or view history from the menu bar on top of the script pane.

Versioning tasks can be performed in R Studio's Git pane.

Setting up Version Control for R Studio Project (usethis)

The usethis package can also be used to set up version control for R Studio projects. Goto the console in R Studio and load the usethis package. If not installed, just do install.packages(usethis).

library(usethis)

Next, find out what the use_git command does.

?use_git

In the R Studio Help pane, the following information about use_git will be shown and it is apparent that this command will create a new Git repository and setup the R Studio project for tracking.

Initialise a git repository
Description
use_git() initialises a Git repository and adds important files to .gitignore. If user consents, it also makes an initial commit.

Usage
use_git(message = "Initial commit")
Arguments
message 
Message to use for first commit.
use_git()

Users will be asked if it is okay to do an initial commit and include a commit message, select no for now. Then, select yes to restarting R Studio to complete setting up the Git repository for the project.

Summary

The above sections demonstrated the first steps in versioning using Git in R Studio. These include creating a R Studio project and initiating a Git repository. The term repository just describes a collection of files whose history is tracked by Git.

Configuring Git

It is important to set some configurations for Git so that it can keep track of who made changes and how to contact in case questions arise. The use_git_config command from usethis allows for setting configurations.

?use_git_config
Configure Git
Description
Sets Git options, for either the user or the project ("global" or "local", in Git terminology). Wraps gert::git_config_set() and gert::git_config_global_set(). To inspect Git config, see gert::git_config().

Usage
use_git_config(scope = c("user", "project"), ...)
Arguments
scope   
Edit globally for the current user, or locally for the current project

... 
Name-value pairs, processed as <dynamic-dots>.

Value
Invisibly, the previous values of the modified components, as a named list.

See Also
Other git helpers: use_git(), use_git_hook(), use_git_ignore()

Examples
Run examples

## Not run: 
# set the user's global user.name and user.email
use_git_config(user.name = "Jane", user.email = "jane@example.org")

The use_git_config construct below adds the user's name and email to the configurations.

use_git_config(user.name="first.last", user.email="user@nih.gov")

To view the configuration file use edit_git_config().

Staging and Commiting Changes

Open the blank git_in_rstudio.R script and check the box under the "Staged" column next to the script git_in_rstudio.R and git_in_rstudio.Rproj to stage these two files for commit. Once the files have been staged, users will see the status icon change from "?" to "A", where "A" stands for added.

Note

"The staging area is composed of file(s) that Git should track the history of (but no history has been saved at that point)" --https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/CC2024/version_control_git_cli/version_control_git_cli/.

Next, hit the "Commit" button. Committing is the process of writing and saving changes to file(s). Be sure to include an informative commit message at this step. Hit the "Commit` button when ready.

After the commit, users will see a message indicating the number of files in which history was saved and preserved. Close this window when ready to review change.

Summary

The process of tracking changes using Git starts with staging or adding file(s) for commit. This does not save the file(s) history but tells Git to track it. When committed, the history of file(s) at a given point in time is saved.

Click on "History" in the Git pane to pull up the review changes window where users can see the history, commit message, date and time of commit, as well as author who made the commit. The number labeled "SHA" is the commit ID. Note that in the subject column, there is a word "HEAD", which refers to "the most recent commit to the current checkout branch" (source: https://www.geeksforgeeks.org/git-head/)

Tracking History

Add the following comment line to git_in_rstudio.R. Hit save afterwards and notice that the script is now under the M or modified status. Check the "Staged" box to add these changes for commit. Remember to add a meaningful commit message.

# This script demonstrates versioning using Git in R Studio.

In the review changes window, users can:

  • See that HEAD is now pointing to the most current commit along with the commit message.
  • View lines that were added (highlighted in green).
  • View file for any commit by clicking on "View file @ ########", where "########" is the ID that correspond to a commit.

Next, add the following and the commit.

# Load data
data(mtcars)

After committing the above changes, users will see that the code added in the current version is highlighted in green.

Use the View command to take a look at mtcars and commit the changes.

# Take a look at mtcars
View(mtcars)

Lines that are deleted in the current commit are highlighted in red. For instance, add the following to git_in_rstudio.R.

# Load tidyverse
library(tidyverse)

# Get average mpg by for each cylinder
mtcars %>% group_by(cyl) %>% summarise(mean_mpg=mean(mpg))

Then, delete the following and commit again.

# Take a look at mtcars
View(mtcars)

Finally, add the code below to generate a box plot showing the distribution of mpg by number of cylinders. Commit when done.

# Look at distribution of mpg by cylinder using box plot
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg))+geom_boxplot()+xlab("number of cylinders")

Sharing Code on GitHub

Suppose that calculating the average mpg by cylinder and obtaining a box plot to show the mpg distrition by cylinder statisfies the goal of analysis, users can share this code on GitHub.

Setting GitHub Credentials

If users have not done so, create a GitHub token using the command below from the usethis package.

create_github_token()

Tokens can also be generated from the GitHub website. See https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens to learn how.

library(gitcreds)
gitcreds_set()

Users will be given options to abort setting Git credentials, replace existing credentials, or set credentials using password or token when gitcreds_set() is run.

-> What would you like to do? 

1: Abort update with error, and keep the existing credentials
2: Replace these credentials
3: See the password / token

Selection: 

Pushing Code to GitHub

Then to push to GitHub, use the following from usethis.

use_github()

Deleting GitHub Credentials

For Macs, once the GitHub credentials has been set using gitcreds_set(), the token gets stored in the key chain. Use gitcreds_delete() to remove GitHub credential from memory.

Push Code to GitHub using GUI

The green "Push" button in the Git pane can also be used to send to code to GitHub.

Add the following code then save and commit the changes.

# Correlation between displacement and quarter mile time
ggplot(mtcars, aes(x=hp,y=qsec))+geom_point()

Click on the green "Push" button on the Git pane to send the updated git_in_rstudio.R script to GitHub.

Users will have to enter their computer password when prompted.

The message shown in the screen capture below will appear when the code has been successfully pushed to GitHub.

Delete a File from Tracking

Create a script called git_in_rstudio1.R and add the following. Save, stage, commit, and add to GitHub.

data("ToothGrowth")
library(tidyverse)
ggplot(ToothGrowth, aes(x=as.factor(dose), y=len, color=supp))+geom_boxplot()

Suppose that the user wants to remove git_in_rstudio1.R from tracking, just check the file in the "File" pane and hit "Delete". Notice that a red "D" appears next to the deleted file in the Git pane.

Check the box next to git_in_rstudio1.R in the Staged column to add it for tracking and then go ahead and commit the changes.

A message will appear saying that git_in_rstudio1.R has been removed. Make sure that the git_in_rstudio1.R script was removed as well in GitHub.

Restoring Deleted File

To restore git_in_rstudio1.R, click on "History" in the Git pane and select the commit for the file that the user wants to restore. Click on "View file @" in the version comparison box to retrieve the script.

Once the script is retrieved, click the save icon to save.

Branches and Merging

During collaboration, branching in Git enables individuals in Git to test code without influencing the main workflow that teammates are working on. When ready, code that is developed in branches can be merged to the main workflow.

Suppose that a collaborator made a branch called development1, added the following code and pushed to GitHub.

# Get average quarter mile time by cylinder
mtcars %>% group_by(cyl) %>% summarise(mean_qsec=mean(qsec))

Teammates can use the "Pull" button in the Git pane to save the content from development1 branch onto local computer.

The message below will be shown then the contents from development1 on GitHub has been successfully transferred to local computer. The development1 branch is now under tracking by Git locally.

>>> /opt/homebrew/Cellar/git/2.49.0/bin/git pull --rebase
From https://github.com/JWrows2014/git_in_rstudio
 * [new branch]      development1 -> origin/development1
Already up to date.

Click on the drop down labeled "Main" to view branches available and to switch to work on a different Git branch.

To save the version of git_in_rstudio.R in the development1 branch to the main branch, open a terminal in R Studio and make sure to change into the project directory and on the main git branch, then use the git checkout command below to update the script in the main branch.

git checkout development1 git_in_rstudio.R

When the git_in_rstudio.R script on the main branch has been updated, the following message will appear in the terminal.

git_in_rstudio.R
Updated 1 path from 5d76e56

Staying in the terminal and on the main Git branch, type git status and users will see that the git_in_rstudio.R script has been modified and can then go back to the Git pane to stage and commit the changes.

git status
On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
        modified:   git_in_rstudio.R

Recap

As a recap, the process of versioning using Git in R Studio involves:

  1. Creating a R Studio project.
  2. Setting up a Git repository for the R Studio project.
  3. Staging files for commit.
  4. Committing changes to files. Remember to include an informative commit message.
  5. Push scripts and project to GitHub if collaborating and recommended for analysis reproducibility purposes.

Of course, users have the options to remove files from tracking, revert to previous versions, and merge work done on different Git branches.

Useful Learning Resources