Skip to content

Version control using Git

Content from this class is adapted from the Software Carpentry's Version Control with Git

Learning objectives

This class will introduce Git as a version control system for files and code on local computer. After this session, participants will be able to

  • Understand the importance of version control
    • Describe version control
    • Provide rationale for using version control systems
  • Describe Git
  • Know how to access Git
    • Be aware of guides for installing Git on personal computer
    • Be aware of the availability of Git on Biowulf, the NIH high performance computing system
  • Define repository
  • Know the steps involved in version control, including
    • Creating a new repository
    • Understanding the difference between tracked and untracked files
    • Excluding files from being tracked
    • Staging files with changes
    • Commiting changes and writing commit messages
    • Viewing commit logs
    • Compare versions
    • Revert to previous versions

Automated version control

Ultimately, we want to avoid this situation here:

Version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on.

Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.

What is Git? What is a repository?

Git is a version control system originally used to enable developers to work collaboratively on large software projects. Git manages the evolution of a set of files (a repository) over time. It's not just for software: any file can be included in a repository. For example you can use it for tracking the changes of a data analysis project (data files (.tsv,.csv), reports, figures, and scripts/source code).

Note

For simplicity, this class will demonstrate version control with Git on the command line using a plain text file.

Accessing Git

Git installation instructions for computers running Linux, MacOS, or Windows can be found at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git. Windows users can also consider Git BASH as it can be used to ssh in to remote computers like Biowulf, the high performance computing cluster at NIH.

Note

This class demonstrates Git on a personal computer in the folder /Users/tillodc/teaching (where tillodc is the instructor's NIH username).

Setting up Git for the first time

On a command line, Git commands are written as git verb options, where verb is what we actually want to do and options is additional optional information which may be needed for the verb (or subcommand).

Before we start to use git for the first time, it is useful to configure some settings that are useful for all of a user's projects going forward (i.e. global settings).

From the configuration hierarchy above, the configurations can be found in \~/.gitconfig (Mac/Unix) or in C:\Users\.gitconfig (Windows).

Configure username

Again, due to its use a collaboration tool, it is important to set a username. This enables the project team to learn who made the changes. Here's the break down of the command below.

  • All Git commands begin with git
  • This followed by a subcommand (ie. config)
  • Next, there are options (this example sets the --global configuration)
  • The next part specify to Git what configuration parameter should be changed (ie. user.name)
  • Finally, enter the username (user first and last name would be informative)
git config --global user.name "username"

Configure email

Further, setting an email helps the project team know how to contact someone who made changes to code in order to discuss.

git config --global user.email "useremail"

Tip

If you plan to use GitHub, you should use the email address as the one used when you set up your GitHub account. If you elect to use a private email address with GitHub, you can use replacing username with your GitHub one.

Configure line endings

Caution

"As with other keys, when you hit Enter or ↵ or on Macs, Return on your keyboard, your computer encodes this input as a character. Different operating systems use different character(s) to represent the end of a line. (You may also hear these referred to as newlines or line breaks.) Because Git uses these characters to compare files, it may cause unexpected issues when editing a file on different machines. Though it is beyond the scope of this lesson, you can read more about this issue in the Pro Git book." -- https://swcarpentry.github.io/git-novice/02-setup.html

To set line ending configurations for Mac or Linux, use the following.

git config --global core.autocrlf input

For Windows, use the following.

git config --global core.autocrlf true

Configure default branch name

Definition

"Branches allow users to develop features, fix bugs, or safely experiment with new ideas in a contained area of your repository." -- GitHub

Any changes to a source file are associated with a branch. Configure the name of the branch created when you initialize any new repository. If you plan to use GitHub at some point, it's best to set the branch name to main.

git config --global init.defaultBranch main

Configure editor

It may be useful to configure the text editor used by git (the default is vim). Today's instructor happens to prefer emacs:

git config --global core.editor "emacs"

Below is a list of other commonly used text editors:

Editor Configuration command
Atom $ git config --global core.editor "atom --wait"
nano $ git config --global core.editor "nano -w"
BBEdit (Mac, with command line tools) $ git config --global core.editor "bbedit -w"
Sublime Text (Mac) $ git config --global core.editor "/Applications/Sublime\ Text.app/Contents/SharedSupport/bin/subl -n -w"
Sublime Text (Win, 32-bit install) $ git config --global core.editor "'c:/program files (x86)/sublime text 3/sublime_text.exe' -w"
Sublime Text (Win, 64-bit install) $ git config --global core.editor "'c:/program files/sublime text 3/sublime_text.exe' -w"
Notepad (Win) $ git config --global core.editor "c:/Windows/System32/notepad.exe"
Notepad++ (Win, 32-bit install) $ git config --global core.editor "'c:/program files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
Notepad++ (Win, 64-bit install) $ git config --global core.editor "'c:/program files/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
Kate (Linux) $ git config --global core.editor "kate"
Gedit (Linux) $ git config --global core.editor "gedit --wait --new-window"
Scratch (Linux) $ git config --global core.editor "scratch-text-editor"
Emacs $ git config --global core.editor "emacs"
Vim $ git config --global core.editor "vim"
VS Code $ git config --global core.editor "code --wait"

We can test if the editor has changed using the following command:

git config --global --edit

Viewing configs

All settings:

git config --list

System-wide settings:

git config --system --list

Global settings (user-specific):

git config --global --list

Local settings (repository-specific)

git config --local --list

Get help with Git commands

Git documentation

Use man followed the the command of interest (ie. git) to pull up a manual. Users will be able to page through the manual on the terminal. Hit q to exit the manual and return to the prompt.

man git 
GIT(1)                            Git Manual                            GIT(1)

NAME
       git - the stupid content tracker

SYNOPSIS
       git [-v | --version] [-h | --help] [-C <path>] [-c <name>=<value>]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p|--paginate|-P|--no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           [--config-env=<name>=<envvar>] <command> [<args>]

DESCRIPTION
       Git is a fast, scalable, distributed revision control system with an
       unusually rich command set that provides both high-level operations and
       full access to internals.

Tip

git help git is another method for pulling up the Git manual.

Git also implements the --help or -h options for displaying help documents. For instance

git --help

Users can also pull up a description of the Git subcommands.

git help -a
See `git help <command>` to read about a specific subcommand

Main Porcelain Commands
   add                     Add file contents to the index
   am                      Apply a series of patches from a mailbox
   archive                 Create an archive of files from a named tree
   bisect                  Use binary search to find the commit that introduced a bug
   branch                  List, create, or delete branches

Extremely useful is the ability to view a glossary of Git terms.

git help glossary
GITGLOSSARY(7)                       Git Manual                       GITGLOSSARY(7)

NAME
       gitglossary - A Git Glossary

Note

Some of the methods for viewing help just pulls up the relevant section in the Git manual.

Help with a subcommand

git subcommand -h

OR

git subcommand --help

For example

git config -h
git config --help

Note

--help will print the manual for a specific subcommand

General version control workflow using Git

The general process of version control using Git involves creating a repository that stores the history or snapshots of file(s) in a project. The staging area is composed of file(s) that Git should track the history of (but no history has been saved at that point). Once users, are satisfied with a set of changes then these could be written and saved into a snapshot in a process called committing.

Tip

See https://gist.github.com/luismts/495d982e8c5b1a0ced4a57cf3d93cf60 for best practices on when to create a snapshot of changes (ie. committing).

Create new Git repository

Use git init followed by the folder name to initiate a new Git project. For instance, this exercise will create a new Git project called planets, which will provide information about planets (as used in the software carpentries version control class https://swcarpentry.github.io/git-novice/03-create.html).

git init planets

Using ls -l to list directory content will reveal a new folder called planets. It is possible to initiate a git repository for an existing folder.

ls -l
drwxr-xr-x  3 wuz8  NIH\Domain Users  96 Mar 25 21:58 planets

Change into to the folder planets and do a long listing of all content.

cd planets
ls -al
drwxr-xr-x  3 wuz8  NIH\Domain Users   96 Mar 25 21:58 .
drwxr-xr-x  3 wuz8  NIH\Domain Users   96 Mar 25 21:58 ..
drwxr-xr-x  9 wuz8  NIH\Domain Users  288 Mar 25 21:58 .git

A folder called .git was created upon creation of this repository.

Definition

"Git uses the .git folder to store all the information about the project, including the tracked files and sub-directories located within the project’s directory. If we ever delete the .git subdirectory, we will lose the project’s history." -- https://swcarpentry.github.io/git-novice/03-create.html.

Tip

Do not create a repository inside a repository.

Get status of a repository

Because the planets repository was created, the command below tells us that there are no commits yet and there is nothing to commit as file(s) have not been created and added to the staging area.

git status
On branch main

No commits yet

nothing to commit (create/copy files and use "git add" to track)

Tracking changes

To start learning how to track changes using Git, a text file called mars will be created in the directory /Users/tillodc/teaching/planets. This file will contain notes about the planet mars. Note that Git will be able to track plain text files (including CSV) and scripts.

nano mars.txt

Type the following in to mars.txt and then save and exit the editor.

Cold and dry, but everything is my favorite color.
cat mars.txt
Cold and dry, but everything is my favorite color.

Git will tell users that there is an untracked file when viewing the repository status. At this stage, Git does not know to track mars.txt yet.

git status
On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    mars.txt

nothing added to commit but untracked files present (use "git add" to track)

Use git add to tell Git to track this file (ie. place it in the staging area). Note that a snapshot of this file has not yet created, Git will just know to track it.

git add mars.txt

Now, git status will reveal that mars.txt has been staged but not yet committed (ie. no snapshot of its revision history). To create this snapshot, use git commit. Inlude the -m option in git commit to write a commit message, which will inform the user, the user's future self, as well as collaborators what the commit was about.

git status
On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   mars.txt
git commit -m "Start notes on Mars as a base."
[main (root-commit) 27d9e09] Start notes on Mars as a base.
 1 file changed, 1 insertion(+)
 create mode 100644 mars.txt

Note

"When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git directory. This permanent copy is called a commit (or revision)..." -- https://swcarpentry.github.io/git-novice/04-changes.html

The output for git commit contains some important information.

After committing, git status will indicate that everything is up-to-date.

git status
On branch main
nothing to commit, working tree clean

Definition

"The working tree is the set of all files and folders a developer can add, edit, rename and delete during application development. More colloquially, developers often refer to the Git working tree as the workspace or the working directory. But the technical name for the collection of files and folders in a repository is the Git working tree." -- TheServerSide

Note

"The phrase working tree clean means that your working tree (meaning your directory) is clean, i.e. the files in your directory exactly match the files in the last saved snapshot version in git." -- https://chryswoods.com/introducing_git/committing.html#:\~:text=The%20phrase%20working%20tree%20clean,saved%20snapshot%20version%20in%20git.

Definition

git log lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created. -- https://swcarpentry.github.io/git-novice/04-changes.html

The command git log prints details regarding each commit for a project.

git log

Output for git log includes:

  • Full commit ID and branch in which the commit was made (ie. main) in the first line
  • The person who made the commit and email in the second line
  • Date and time in which the commit was made in the third line
  • Commit message in the fourth line

Definition

HEAD refers to "The most recent commit to the current checkout branch is indicated by the HEAD" -- https://www.geeksforgeeks.org/git-head/

commit 27d9e095e9144047f8a81063061be349e323825f (HEAD -> main)
Author: Joe Wu <wuz8@nih.gov>
Date:   Thu Mar 28 15:43:38 2024 -0400

    Start notes on Mars as a base.

Add the following line to the file mars.txt using.

The two moons may be a problem for Wolfman.

git status indicates that the mars.txt file was changed but modifications have not been commited.

git status
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   mars.txt

no changes added to commit (use "git add" and/or "git commit -a")

To compare between versions use git diff.

git diff

In the git diff output

  • The first line indicates which files are being compared (here different versions of mars.txt is compared)
    • The first file is denoted by "a" or "-"
    • The second file is denoted by "b" or "+"
    • Prior to committing changes, the default order of comparison is old, new (ie. diff --git old new). Thus, in the first line of the git diff output, "a" or "-" references the old version of mars.txt while the "b" or "+" references the new version
  • The second line contains the git staging indices for the two files being compared (ie. 077071c and 42d92e3). Again, in old, new order.
  • The third and fourth lines just lists the two files being compared.
  • The fifth line contains header information enclosed by "\@\@". The information in the header line from left to right are as follow:
    • -1:
      • the negative sign denotes the version of mars.txt corresponding to index 077071c (also denoted as "a")
      • the "1" indicates that this file starts at line 1
    • +1:
      • the positive sign denotes the version of mars.txt corresponding to index 42d92e3 (also denoted as "b")
      • the "1" indicates that this file starts at line 1
    • 2:
      • indicates that the version of mars.txt corresponding to index 42d92e3 contains 2 lines
  • Finally, the content of the most recent version of mars.txt is shown. The line starting with "+" indicates an added line. A "-" sign denotes a deleted line.

In a nutshell, git diff tells us what is being compared and what has changed (either added or deleted).

Note

See https://stackoverflow.com/questions/2529441/how-to-read-the-output-from-git-diff for an interpretation of the git diff output.

diff --git a/mars.txt b/mars.txt
index 077071c..42d92e3 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,2 @@
 Cold and dry, but everything is my favorite color.
+The two moons may be a problem for Wolfman.

Add the revised mars.txt to the staging area and commit the changes.

git add mars.txt
git commit -m "Add concerns about effects of Mars' moons on Wolfman"

Add the following line to mars.txt.

But the Mummy will appreciate the lack of humidity.

Add the updated mars.txt file to the staging area.

git add mars.txt

As a result of staging, git diff will not generate an output. Instead use

git diff --staged
diff --git a/mars.txt b/mars.txt
index 42d92e3..2a282ee 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1,2 +1,3 @@
 Cold and dry, but everything is my favorite color.
 The two moons may be a problem for Wolfman.
+But the Mummy will appreciate the lack of humidity.

Now, commit the changes.

git commit -m "Discuss concerns about Mars' climate for Mummy"
[main 1d60e81] Discuss concerns about Mars' climate for Mummy
 1 file changed, 1 insertion(+)

Examine a history of what was done.

git log
commit 1d60e81fe44f5d4d4e01d42f7461c3ff4c039131 (HEAD -> main)
Author: Joe Wu <wuz8@nih.gov>
Date:   Wed Apr 3 17:21:40 2024 -0400

    Discuss concerns about Mars' climate for Mummy

commit 5a54e5703bf5aaef0f9673191ceb59c6b9cc3ec7
Author: Joe Wu <wuz8@nih.gov>
Date:   Wed Apr 3 17:17:42 2024 -0400

    Add concerns about effects of Mars' moons on Wolfman

commit 27d9e095e9144047f8a81063061be349e323825f
Author: Joe Wu <wuz8@nih.gov>
Date:   Thu Mar 28 15:43:38 2024 -0400

    Start notes on Mars as a base.

Tip

Use git log --oneline to compress output to oneline.

Note

Git does not track directories but it tracks files in directories.

Add the following to mars.txt.

An ill-considered change.
git diff HEAD mars.txt
diff --git a/mars.txt b/mars.txt
index 2a282ee..1fe3425 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1,3 +1,4 @@
 Cold and dry, but everything is my favorite color.
 The two moons may be a problem for Wolfman.
 But the Mummy will appreciate the lack of humidity.
+An ill-considered change.s

git diff mars.txt and git diff HEAD will produce the same results as git diff HEAD mars.txt.

Append "\~" followed by a number to compare a certain number of commits back. A ~1 indicates the commit previous to HEAD.

git diff HEAD~1 mars.txt

For two commits previous to HEAD, use ~2.

git diff HEAD~2 mars.txt
diff --git a/mars.txt b/mars.txt
index 077071c..1fe3425 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,4 @@
 Cold and dry, but everything is my favorite color.
+The two moons may be a problem for Wolfman.
+But the Mummy will appreciate the lack of humidity.
+An ill-considered change.

Reverting to previous version

The last line, "An ill-considered change." added to mars.txt has not been staged for commit yet although git status will indicate that mars.txt has been modified.

git status
On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   mars.txt

no changes added to commit (use "git add" and/or "git commit -a")

To go back to the version of mars.txt prior to adding "An ill-considered change.", use the following.

git checkout HEAD mars.txt
cat mars.txt
Cold and dry, but everything is my favorite color.
The two moons may be a problem for Wolfman.
But the Mummy will appreciate the lack of humidity.

Definition

"git checkout checks out (i.e., restores) an old version of a file" -- https://swcarpentry.github.io/git-novice/05-history.html

Tip

To revert to a specific version, supply the commit ID to git checkout.

git checkout 27d9e09 mars.txt             
cat mars.txt
Cold and dry, but everything is my favorite color.

The modification from checking out the commit ID corresponding to version 1 of mars.txt are in the staging area, but not commited.

git status
On branch main
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   mars.txt

To go back to the most recent commited version for mars.txt.

git checkout HEAD mars.txt
Cold and dry, but everything is my favorite color.
The two moons may be a problem for Wolfman.
But the Mummy will appreciate the lack of humidity.

Ignoring things (optional)

Sometimes you don't want git to track certain files (e.g. backup files created by your text editor, large data files, intermediate analysis files). This can be achieved using a special configuration file .gitignore

Let's add some dummy files:

mkdir results 
touch a.csv b.csv c.csv results/a.out results/b.out

and see what Git says:

git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)

    a.csv
    b.csv
    c.csv
    results/

nothing added to commit but untracked files present (use "git add" to track)

Putting these files under version control would be a waste of disk space. What's worse, having them all listed could distract us from changes that actually matter, so let's tell Git to ignore them.

We do this by creating a file in the root directory of our project called .gitignore:

  nano .gitignore

Add the following text to the .gitignore file

*.csv
results/

These patterns tell Git to ignore any file whose name ends in .csv and everything in the results directory. (If any of these files were already being tracked, Git would continue to track them.)

Once we have created this file, the output of git status is much cleaner:

 git status
On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)

    .gitignore

nothing added to commit but untracked files present (use "git add" to track)

The only thing Git notices now is the newly-created .gitignore file. You might think we wouldn't want to track it, but everyone we're sharing our repository with will probably want to ignore the same things that we're ignoring. Let's add and commit .gitignore:

git add .gitignore
git commit -m "Ignore data files and the results folder"

git status

On branch main
nothing to commit, working tree clean

As a bonus, using .gitignore helps us avoid accidentally adding files to the repository that we don't want to track:

  git add a.csv
The following paths are ignored by one of your .gitignore files:
a.csv
Use -f if you really want to add them.

If we really want to override our ignore settings, we can use git add -f to force Git to add something. For example, git add -f a.csv. We can also always see the status of ignored files if we want:

  git status --ignored
On branch main
Ignored files:
 (use "git add -f <file>..." to include in what will be committed)

        a.csv
        b.csv
        c.csv
        results/

nothing to commit, working tree clean

Tip

You can set up a global .gitignore file to be used in all of your projects. You will need to set your global config to point to it : git config --global core.excludesFile '~/.gitignore'

Remotes in Git (optional)

Note

Before you start -- this assumes you have set up an ssh private/public key to work with github. See the instructions here.

Now that we've finished work on our project locally, we'd like to share it with our collaborators / the world. To this end we are going to create a remote repository that will be linked to our local repository.

1. Create a remote repository

Log in to GitHub, then click on the icon in the top right corner to create a new repository called planets:

Creating a Repository on GitHub (Step 1)

Name your repository "planets" and then click "Create Repository".

Note: Since this repository will be connected to a local repository, it needs to be empty. Leave "Initialize this repository with a README" unchecked, and keep "None" as options for both "Add .gitignore" and "Add a license." See the "GitHub License and README files" exercise below for a full explanation of why the repository needs to be empty.

Creating a Repository on GitHub (Step 2)

As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository:

Creating a Repository on GitHub (Step 3)

This effectively does the following on GitHub's servers:

mkdir planets
cd planets
git init

2. Connect local to remote repository

Now we connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the URL string we need to identify it:

Where to Find Repository URL on GitHub

Definition

A remote (of a repository): A version control repository connected to another, in such way that both can be kept in sync exchanging commit. -- https://swcarpentry.github.io/git-novice/reference.html#remote

Click on the 'SSH' link to change the protocol from HTTPS to SSH.

Changing the Repository URL on GitHub

Copy that URL from the browser, go into the local planets repository, and run this command:

git remote add origin git@github.com:desireetillo/planets.git

origin is a local name used to refer to the remote repository. It could be called anything, but origin is a convention that is often used by default in git and GitHub, so it's helpful to stick with this unless there's a reason not to.

We can check that the command has worked by running git remote -v:

 git remote -v

and we should see the following output:

origin   git@github.com:desireetillo/planets.git (fetch)
origin   git@github.com:desireetillo/planets.git (push)

3. Push local changes to a remote

Now that authentication is setup, we can return to the remote. This command will push the changes from our local repository to the repository on GitHub:

 git push origin main

If you set up a passphrase, with your ssh key, you will be prompted for it

Enumerating objects: 16, done.
Counting objects: 100% (16/16), done.
Delta compression using up to 8 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (16/16), 1.45 KiB | 372.00 KiB/s, done.
Total 16 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), done.
To https://github.com/desireetillo/planets.git
 * [new branch]      main -> main

We can pull changes from the remote repository to the local one as well:

git pull origin main
From https://github.com/desireetillo/planets
 * branch            main     -> FETCH_HEAD
Already up-to-date.

Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub or if you added a file on the GitHub (e.g. a LICENSE or a README file), though, the above command would download them to our local repository.

Further Reading

Git resources

NIH-specific talks on more advanced Git subjects

Using Git in Rstudio

Using Git in VScode

Git clients