Reproducible R with Git
For scientists who analyze data through coding, versioning tools such as Git make tracking and undoing changes easy. See the images below shows if a versioning tool was not used, the number of files generated ane file names through iterations of a project will become cumbersome to keep track of. Researchers can avoid this by using versioning tools such as Git where the history of only one file is saved. Not only can Git be used for tracking changes locally on a personal computer, it also facilitates collaboration and code sharing on platforms such as GitHub. The goal of this Coding Club session is to get participants acquainted with versioning on local computer using Git and R Studio. Concepts learned can be easily applied to versioning via command line or other Integrated Devleopment Environments (IDE) such as Jupyter Lab and VS Code.
For an introduction to versioning with Git in general, see BTEP's Version control using Git class. This class focused on versioning on local computer using Git via the command line. In this coding club session however, participants will learn how to version using Git in R Studio.
Learning Objectives
After this class, participants will:
- Understand the benefits of versioning.
- Know the rationale for using R Studio projects.
- Be informed of how to setup Git for versioning in R Studio.
- Be able to perform versioning tasks such as:
- Staging and committing changes.
- View history of a file.
- Revert to a previous version of a file.
- Delete a file from tracking.
- Share code on GitHub.
R Packages Used
usethis
gitcreds
Creating a R Studio project
The first step to versioning with Git and R studio is to create a project. Projects are useful as these are self contained folders that contain code and input. When shared with, collaborators can save R Studio projects in any folder on local computer. Then, as long as the project is opened in R Studio, they can re-run a script. In other words, projects create relative paths to the code and input so they can be run regardless where they are stored on local disk drive.
In this example, a project called git_in_rstudio
will be created in the instructor's local Documents
folder. The first step is to click "File" in the R Studio menu and select "New Project".
In the subsequent dialogue box, choose to create the project in a new working directory. There are options to create a project from existing folder or open an already versioned project.
Next, choose to start a "New Project".
Finally, enter the name of the project directory (ie. git_in_rstudio
) and then click "Create Project".
In the top right corner of the R Studio window, users will see the project once it has been created. Files and directories within the project folder appears in the "File" pane of R Studio. Users can list all of the files that Git should not track in .gitignore
.
Note
"Git uses the .git folder to store all the information about the project, including the tracked files and sub-directories located within the project’s directory. If we ever delete the .git subdirectory, we will lose the project’s history." -- https://swcarpentry.github.io/git-novice/03-create.html.
After creating the project, start a new R script and name it git_in_rstudio.R
.
If the "Create a git repository" option was not available when creating the project, that means R Studio does not know the path to Git on the computer. However users can still create the project.
To resolve this issue, go to the command line and type one of the following to find the path to Git.
For Macs:
which git
For instance, on the instructor's local computer, the path is /opt/homebrew/bin/git
.
For Windows:
where git
Then, goto "Tools" in the R Studio menu and select "Git/SVN" to specify the path to the Git executable.
After the path to the Git executable has been set, navigate to "Tools" in the R Studio menu and select "Version Control" and then "Project Setup".
In the subsequent dialogue box, choose "Git/SVN" from the left side menu. Then select Git as the version control system.
Hit "Yes" to confirm the set up of a new Git repository.
Users can perform versioning tasks such as commit, push, pull, or view history for individual files from bar.
Project wide versioning tasks can be performed in R Studio's Environment pane by clicking on the Git tab. Here, the .gitignore
and .Rproj
files are visible along with scripts.
Setting up Version Control for R Studio Project (usethis
)
But because this class is focused on using Git with R Studio, it would be more appropriate to introduce the usethis
package. Goto the console in R Studio and load the usethis
package. If not installed, just do library(usethis)
.
library(usethis)
Next, find out what the use_git
command does.
?use_git
In the R Studio Help pane, the following information about use_git
will be shown.
Initialise a git repository
Description
use_git() initialises a Git repository and adds important files to .gitignore. If user consents, it also makes an initial commit.
Usage
use_git(message = "Initial commit")
Arguments
message
Message to use for first commit.
use_git()
Users will be asked if it is okay to do an initial commit and include a commit message, select no for now. Then, select yes to restarting R Studio to complete setting up the Git repository in the project.
Summary
The above sections demonstrated the first steps in versioning using Git in R Studio. These include creating a R Studio project and initiating a Git repository. The term repository just describes a collection of files whose history is tracked by Git.
Configuring Git
It is important to set some configurations for Git so that it can keep track of who made changes and how to contact in case questions arise. The use_git_config
command from usethis allows for setting configurations.
?use_git_config
Configure Git
Description
Sets Git options, for either the user or the project ("global" or "local", in Git terminology). Wraps gert::git_config_set() and gert::git_config_global_set(). To inspect Git config, see gert::git_config().
Usage
use_git_config(scope = c("user", "project"), ...)
Arguments
scope
Edit globally for the current user, or locally for the current project
...
Name-value pairs, processed as <dynamic-dots>.
Value
Invisibly, the previous values of the modified components, as a named list.
See Also
Other git helpers: use_git(), use_git_hook(), use_git_ignore()
Examples
Run examples
## Not run:
# set the user's global user.name and user.email
use_git_config(user.name = "Jane", user.email = "jane@example.org")
The use_git_config
construct below adds the user's name and email to the configurations.
use_git_config(user.name="first.last", user.email="user@nih.gov")
Staging and Commiting Changes
Open the blank git_in_rstudio.R
script and check check to box under the "Staged" column next to the script git_in_rstudio.R
and git_in_rstudio.Rproj
to stage these two files for commit. Once the files have been staged, users will see the status icon change from "?" to "A", where "A" stands for added.
Note
"The staging area is composed of file(s) that Git should track the history of (but no history has been saved at that point)" --https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/CC2024/version_control_git_cli/version_control_git_cli/.
Next, hit the "Commit" button. Committing is the process of writing and saving changes to file(s). Be sure include an informative commit message at this step. Hit the "Commit` button when ready.
After the commit, users will see a message indicating the number of file that were changed. Close this window when ready to review change.
In the review changes window, users can see the history, commit message, date and time of commit, as well as author who made the commit. The number labeled "SHA" is the commit ID. Note that in the subject column, there is word "HEAD", which refers to "the most recent commit to the current checkout branch" (source: https://www.geeksforgeeks.org/git-head/)
Tracking History
Add the following comment line to git_in_rstudio.R
. Hit save afterwards and notice that the script is now under the M or modified status. Check the "Staged" box to add these changes for commit. Remember to add a meaningful commit message.
# This script demonstrates versioning using Git in R Studio.
In the review changes window, users can:
- See that HEAD is now pointing to the most current commit along with the commit message.
- The lines that were added (highlighted in green).
- View file for any commit by clicking on "View file @ ########", where "########" is the ID that correspond to a commit.
Next, add the following and the commit.
# Load data
data(mtcars)
After committing the above changes, users will see that the code in the current version is highlighted in green while that for the previos is highlighted in red.
# Take a look at mtcars
View(mtcars)
Lines that are deleted in the current commit are highlighted in red. For instance, add the following to git_in_rstudio.R
.
# Load tidyverse
library(tidyverse)
# Get average mpg by for each cylinder
mtcars %>% group_by(cyl) %>% summarise(mean_mpg=mean(mpg))
Then, delete the following and commit again.
# Take a look at mtcars
View(mtcars)
Finally, add the code below to generate a box plot showing the distribution of mpg by number of cylinders. Commit when done.
# Look at distribution of mpg by cylinder using box plot
ggplot(mtcars, aes(x=as.factor(cyl), y=mpg))+geom_boxplot()+xlab("number of cylinders")
Sharing Code on GitHub
Suppose that calculating the average mpg by cylinder and obtaining a box plot to show the mpg distrition by cylinder statisfies the goal of analysis, users can share this code on GitHub.
Setting GitHub Credentials
If users have not done so, create a GitHub token using the command below from the usethis
package.
create_github_token()
Tokens can also be generated from the GitHub website. See https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens to learn how.
library(gitcreds)
gitcreds_set()
Pushing Code to GitHub
Then to push to GitHub, use the following from usethis
.
use_github()
Deleting GitHub Credentials
For Macs, once the GitHub credentials has been set using gitcreds_set()
, the token gets stored in the key chain. Use gitcreds_delete()
to remove GitHub credential from memory.
Push Code to GitHub using GUI
Users can also use the green "Push" button to send to code to GitHub.
Add the following code then save and commit the changes.
# Correlation between displacement and quarter mile time
ggplot(mtcars, aes(x=hp,y=qsec))+geom_point()
Click on the green "Push" button on the envrionmental pane to send the updated git_in_rstudio.R
script to GitHub.
Mac users will have to enter their computer password when prompted.
The message shown in the screen capture below will appear when the code has been successfully pushed to GitHub.
Delete a File from Tracking
Create a script called git_in_rstudio1.R
and add the following. Save, stage, commit, and add to GitHub.
data("ToothGrowth")
library(tidyverse)
ggplot(ToothGrowth, aes(x=as.factor(dose), y=len, color=supp))+geom_boxplot()
Suppose that the user wants to remove git_in_rstudio1.R
from tracking, just check in the "File" pane and hit "Delete". Notice that a red "D" appears next to the deleted file in the Git pane.
Check the Staged column to stage git_in_rstudio1.R
for commit and then go ahead and commit the changes.
A message will appear saying that git_in_rstudio1.R
has been removed. Make that the git_in_rstudio1.R
script was removed as well.
Restoring Deleted File
To restore git_in_rstudio1.R
but not track it, click on "History" in the Git pane and select the commit for the file that the user wants to restore. Click on "View file @" in the version comparison box to pull up the script.
Once the script is pulled up, click the save icon to save.
Branches and Merging
# Get average quarter mile time by cylinder
mtcars %>% group_by(cyl) %>% summarise(mean_qsec=mean(qsec))
Click "Pull"
>>> /opt/homebrew/Cellar/git/2.49.0/bin/git pull --rebase
From https://github.com/JWrows2014/git_in_rstudio
* [new branch] development1 -> origin/development1
Already up to date.
git checkout development1 git_in_rstudio.R