Version control using Git
Content from this class is adapted from the Software Carpentry's Version Control with Git
Learning objectives
This class will introduce Git as a version control system for files and code on local computer. After this session, participants will be able to
- Understand the importance of version control
- Describe version control
- Provide rationale for using version control systems
- Describe Git
- Know how to access Git
- Be aware of guides for installing Git on personal computer
- Be aware of the availability of Git on Biowulf, the NIH high performance computing system
- Define repository
- Know the steps involved in version control, including
- Creating a new repository
- Understanding the difference between tracked and untracked files
- Excluding files from being tracked
- Staging files with changes
- Commiting changes and writing commit messages
- Viewing commit logs
- Compare versions
- Revert to previous versions
Automated version control
Ultimately, we want to avoid this situation here:
Version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, keeping a record of what was changed, when, and why is extremely useful for all researchers if they ever need to come back to the project later on.
Version control is the lab notebook of the digital world: it’s what professionals use to keep track of what they’ve done and to collaborate with other people. Every large software development project relies on it, and most programmers use it for their small jobs as well. And it isn’t just for software: books, papers, small data sets, and anything that changes over time or needs to be shared can and should be stored in a version control system.
What is Git? What is a repository?
Git is a version control system originally used to enable developers to work collaboratively on large software projects. Git manages the evolution of a set of files (a repository) over time. It's not just for software: any file can be included in a repository. For example you can use it for tracking the changes of a data analysis project (data files (.tsv,.csv), reports, figures, and scripts/source code).
Note
For simplicity, this class will demonstrate version control with Git on the command line using a plain text file.
Accessing Git
Git installation instructions for computers running Linux, MacOS, or Windows can be found at https://git-scm.com/book/en/v2/Getting-Started-Installing-Git. Windows users can also consider Git BASH as it can be used to ssh
in to remote computers like Biowulf, the high performance computing cluster at NIH.
Note
This class demonstrates Git on a personal computer in the folder /Users/tillodc/teaching
(where tillodc is the instructor's NIH username).
Setting up Git for the first time
On a command line, Git commands are written as git verb options
, where verb
is what we actually want to do and options
is additional optional information which may be needed for the verb (or subcommand).
Before we start to use git for the first time, it is useful to configure some settings that are useful for all of a user's projects going forward (i.e. global settings).
From the configuration hierarchy above, the configurations can be found in \~/.gitconfig (Mac/Unix) or in C:\Users\.gitconfig (Windows).
Configure username
Again, due to its use a collaboration tool, it is important to set a username. This enables the project team to learn who made the changes. Here's the break down of the command below.
- All Git commands begin with
git
- This followed by a subcommand (ie.
config
) - Next, there are options (this example sets the
--global
configuration) - The next part specify to Git what configuration parameter should be changed (ie.
user.name
) - Finally, enter the username (user first and last name would be informative)
git config --global user.name "username"
Configure email
Further, setting an email helps the project team know how to contact someone who made changes to code in order to discuss.
git config --global user.email "useremail"
Tip
If you plan to use GitHub, you should use the email address as the one used when you set up your GitHub account. If you elect to use a private email address with GitHub, you can use username\@users.noreply.github.com replacing username with your GitHub one.
Configure line endings
Caution
"As with other keys, when you hit Enter or ↵ or on Macs, Return on your keyboard, your computer encodes this input as a character. Different operating systems use different character(s) to represent the end of a line. (You may also hear these referred to as newlines or line breaks.) Because Git uses these characters to compare files, it may cause unexpected issues when editing a file on different machines. Though it is beyond the scope of this lesson, you can read more about this issue in the Pro Git book." -- https://swcarpentry.github.io/git-novice/02-setup.html
To set line ending configurations for Mac or Linux, use the following.
git config --global core.autocrlf input
For Windows, use the following.
git config --global core.autocrlf true
Configure default branch name
Definition
"Branches allow users to develop features, fix bugs, or safely experiment with new ideas in a contained area of your repository." -- GitHub
Any changes to a source file are associated with a branch. Configure the name of the branch created when you initialize any new repository. If you plan to use GitHub at some point, it's best to set the branch name to main
.
git config --global init.defaultBranch main
Configure editor
It may be useful to configure the text editor used by git (the default is vim). Today's instructor happens to prefer emacs:
git config --global core.editor "emacs"
Below is a list of other commonly used text editors:
Editor | Configuration command |
---|---|
Atom | $ git config --global core.editor "atom --wait" |
nano | $ git config --global core.editor "nano -w" |
BBEdit (Mac, with command line tools) | $ git config --global core.editor "bbedit -w" |
Sublime Text (Mac) | $ git config --global core.editor "/Applications/Sublime\ Text.app/Contents/SharedSupport/bin/subl -n -w" |
Sublime Text (Win, 32-bit install) | $ git config --global core.editor "'c:/program files (x86)/sublime text 3/sublime_text.exe' -w" |
Sublime Text (Win, 64-bit install) | $ git config --global core.editor "'c:/program files/sublime text 3/sublime_text.exe' -w" |
Notepad (Win) | $ git config --global core.editor "c:/Windows/System32/notepad.exe" |
Notepad++ (Win, 32-bit install) | $ git config --global core.editor "'c:/program files (x86)/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin" |
Notepad++ (Win, 64-bit install) | $ git config --global core.editor "'c:/program files/Notepad++/notepad++.exe' -multiInst -notabbar -nosession -noPlugin" |
Kate (Linux) | $ git config --global core.editor "kate" |
Gedit (Linux) | $ git config --global core.editor "gedit --wait --new-window" |
Scratch (Linux) | $ git config --global core.editor "scratch-text-editor" |
Emacs | $ git config --global core.editor "emacs" |
Vim | $ git config --global core.editor "vim" |
VS Code | $ git config --global core.editor "code --wait" |
We can test if the editor has changed using the following command:
git config --global --edit
Viewing configs
All settings:
git config --list
System-wide settings:
git config --system --list
Global settings (user-specific):
git config --global --list
Local settings (repository-specific)
git config --local --list
Get help with Git commands
Git documentation
Use man
followed the the command of interest (ie. git
) to pull up a manual. Users will be able to page through the manual on the terminal. Hit q
to exit the manual and return to the prompt.
man git
GIT(1) Git Manual GIT(1)
NAME
git - the stupid content tracker
SYNOPSIS
git [-v | --version] [-h | --help] [-C <path>] [-c <name>=<value>]
[--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
[-p|--paginate|-P|--no-pager] [--no-replace-objects] [--bare]
[--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
[--config-env=<name>=<envvar>] <command> [<args>]
DESCRIPTION
Git is a fast, scalable, distributed revision control system with an
unusually rich command set that provides both high-level operations and
full access to internals.
Tip
git help git
is another method for pulling up the Git manual.
Git also implements the --help
or -h
options for displaying help documents. For instance
git --help
Users can also pull up a description of the Git subcommands.
git help -a
See `git help <command>` to read about a specific subcommand
Main Porcelain Commands
add Add file contents to the index
am Apply a series of patches from a mailbox
archive Create an archive of files from a named tree
bisect Use binary search to find the commit that introduced a bug
branch List, create, or delete branches
Extremely useful is the ability to view a glossary of Git terms.
git help glossary
GITGLOSSARY(7) Git Manual GITGLOSSARY(7)
NAME
gitglossary - A Git Glossary
Note
Some of the methods for viewing help just pulls up the relevant section in the Git manual.
Help with a subcommand
git subcommand -h
OR
git subcommand --help
For example
git config -h
git config --help
Note
--help
will print the manual for a specific subcommand
General version control workflow using Git
The general process of version control using Git involves creating a repository that stores the history or snapshots of file(s) in a project. The staging area is composed of file(s) that Git should track the history of (but no history has been saved at that point). Once users, are satisfied with a set of changes then these could be written and saved into a snapshot in a process called committing.
Tip
See https://gist.github.com/luismts/495d982e8c5b1a0ced4a57cf3d93cf60 for best practices on when to create a snapshot of changes (ie. committing).
Create new Git repository
Use git init
followed by the folder name to initiate a new Git project. For instance, this exercise will create a new Git project called planets, which will provide information about planets (as used in the software carpentries version control class https://swcarpentry.github.io/git-novice/03-create.html).
git init planets
Using ls -l
to list directory content will reveal a new folder called planets. It is possible to initiate a git repository for an existing folder.
ls -l
drwxr-xr-x 3 wuz8 NIH\Domain Users 96 Mar 25 21:58 planets
Change into to the folder planets and do a long listing of all content.
cd planets
ls -al
drwxr-xr-x 3 wuz8 NIH\Domain Users 96 Mar 25 21:58 .
drwxr-xr-x 3 wuz8 NIH\Domain Users 96 Mar 25 21:58 ..
drwxr-xr-x 9 wuz8 NIH\Domain Users 288 Mar 25 21:58 .git
A folder called .git
was created upon creation of this repository.
Definition
"Git uses the .git
folder to store all the information about the project, including the tracked files and sub-directories located within the project’s directory. If we ever delete the .git
subdirectory, we will lose the project’s history." -- https://swcarpentry.github.io/git-novice/03-create.html.
Tip
Do not create a repository inside a repository.
Get status of a repository
Because the planets repository was created, the command below tells us that there are no commits yet and there is nothing to commit as file(s) have not been created and added to the staging area.
git status
On branch main
No commits yet
nothing to commit (create/copy files and use "git add" to track)
Tracking changes
To start learning how to track changes using Git, a text file called mars will be created in the directory /Users/tillodc/teaching/planets
. This file will contain notes about the planet mars. Note that Git will be able to track plain text files (including CSV) and scripts.
nano mars.txt
Type the following in to mars.txt and then save and exit the editor.
Cold and dry, but everything is my favorite color.
cat mars.txt
Cold and dry, but everything is my favorite color.
Git will tell users that there is an untracked file when viewing the repository status. At this stage, Git does not know to track mars.txt
yet.
git status
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
mars.txt
nothing added to commit but untracked files present (use "git add" to track)
Use git add
to tell Git to track this file (ie. place it in the staging area). Note that a snapshot of this file has not yet created, Git will just know to track it.
git add mars.txt
Now, git status
will reveal that mars.txt has been staged but not yet committed (ie. no snapshot of its revision history). To create this snapshot, use git commit
. Inlude the -m
option in git commit
to write a commit message, which will inform the user, the user's future self, as well as collaborators what the commit was about.
git status
On branch main
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: mars.txt
git commit -m "Start notes on Mars as a base."
[main (root-commit) 27d9e09] Start notes on Mars as a base.
1 file changed, 1 insertion(+)
create mode 100644 mars.txt
Note
"When we run git commit, Git takes everything we have told it to save by using git add and stores a copy permanently inside the special .git
directory. This permanent copy is called a commit (or revision)..." -- https://swcarpentry.github.io/git-novice/04-changes.html
The output for git commit
contains some important information.
- First line:
- It tells users which branch the commit was made on (main branch in this case)
- The abbreviated commit ID (ie. 27d9e09). Each commit in Git can be identified by a commit ID.
- Commit message
- Second line
- The numbers of files that changed and what was done. Here 1 file (mars.txt) changed as a line was inserted into it.
- Third line
After committing, git status
will indicate that everything is up-to-date.
git status
On branch main
nothing to commit, working tree clean
Definition
"The working tree is the set of all files and folders a developer can add, edit, rename and delete during application development. More colloquially, developers often refer to the Git working tree as the workspace or the working directory. But the technical name for the collection of files and folders in a repository is the Git working tree." -- TheServerSide
Note
"The phrase working tree clean means that your working tree (meaning your directory) is clean, i.e. the files in your directory exactly match the files in the last saved snapshot version in git." -- https://chryswoods.com/introducing_git/committing.html#:\~:text=The%20phrase%20working%20tree%20clean,saved%20snapshot%20version%20in%20git.
Definition
git log
lists all commits made to a repository in reverse chronological order. The listing for each commit includes the commit’s full identifier (which starts with the same characters as the short identifier printed by the git commit
command earlier), the commit’s author, when it was created, and the log message Git was given when the commit was created. -- https://swcarpentry.github.io/git-novice/04-changes.html
The command git log
prints details regarding each commit for a project.
git log
Output for git log
includes:
- Full commit ID and branch in which the commit was made (ie. main) in the first line
- The person who made the commit and email in the second line
- Date and time in which the commit was made in the third line
- Commit message in the fourth line
Definition
HEAD refers to "The most recent commit to the current checkout branch is indicated by the HEAD" -- https://www.geeksforgeeks.org/git-head/
commit 27d9e095e9144047f8a81063061be349e323825f (HEAD -> main)
Author: Joe Wu <wuz8@nih.gov>
Date: Thu Mar 28 15:43:38 2024 -0400
Start notes on Mars as a base.
Add the following line to the file mars.txt using.
The two moons may be a problem for Wolfman.
git status
indicates that the mars.txt file was changed but modifications have not been commited.
git status
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git checkout -- <file>..." to discard changes in working directory)
modified: mars.txt
no changes added to commit (use "git add" and/or "git commit -a")
To compare between versions use git diff
.
git diff
In the git diff
output
- The first line indicates which files are being compared (here different versions of mars.txt is compared)
- The first file is denoted by "a" or "-"
- The second file is denoted by "b" or "+"
- Prior to committing changes, the default order of comparison is old, new (ie.
diff --git old new
). Thus, in the first line of thegit diff
output, "a" or "-" references the old version of mars.txt while the "b" or "+" references the new version
- The second line contains the git staging indices for the two files being compared (ie. 077071c and 42d92e3). Again, in old, new order.
- The third and fourth lines just lists the two files being compared.
- The fifth line contains header information enclosed by "\@\@". The information in the header line from left to right are as follow:
-1
:- the negative sign denotes the version of mars.txt corresponding to index 077071c (also denoted as "a")
- the "1" indicates that this file starts at line 1
+1
:- the positive sign denotes the version of mars.txt corresponding to index 42d92e3 (also denoted as "b")
- the "1" indicates that this file starts at line 1
2
:- indicates that the version of mars.txt corresponding to index 42d92e3 contains 2 lines
- Finally, the content of the most recent version of mars.txt is shown. The line starting with "+" indicates an added line. A "-" sign denotes a deleted line.
In a nutshell, git diff
tells us what is being compared and what has changed (either added or deleted).
Note
See https://stackoverflow.com/questions/2529441/how-to-read-the-output-from-git-diff for an interpretation of the git diff
output.
diff --git a/mars.txt b/mars.txt
index 077071c..42d92e3 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,2 @@
Cold and dry, but everything is my favorite color.
+The two moons may be a problem for Wolfman.
Add the revised mars.txt to the staging area and commit the changes.
git add mars.txt
git commit -m "Add concerns about effects of Mars' moons on Wolfman"
Add the following line to mars.txt.
But the Mummy will appreciate the lack of humidity.
Add the updated mars.txt file to the staging area.
git add mars.txt
As a result of staging, git diff
will not generate an output. Instead use
git diff --staged
diff --git a/mars.txt b/mars.txt
index 42d92e3..2a282ee 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1,2 +1,3 @@
Cold and dry, but everything is my favorite color.
The two moons may be a problem for Wolfman.
+But the Mummy will appreciate the lack of humidity.
Now, commit the changes.
git commit -m "Discuss concerns about Mars' climate for Mummy"
[main 1d60e81] Discuss concerns about Mars' climate for Mummy
1 file changed, 1 insertion(+)
Examine a history of what was done.
git log
commit 1d60e81fe44f5d4d4e01d42f7461c3ff4c039131 (HEAD -> main)
Author: Joe Wu <wuz8@nih.gov>
Date: Wed Apr 3 17:21:40 2024 -0400
Discuss concerns about Mars' climate for Mummy
commit 5a54e5703bf5aaef0f9673191ceb59c6b9cc3ec7
Author: Joe Wu <wuz8@nih.gov>
Date: Wed Apr 3 17:17:42 2024 -0400
Add concerns about effects of Mars' moons on Wolfman
commit 27d9e095e9144047f8a81063061be349e323825f
Author: Joe Wu <wuz8@nih.gov>
Date: Thu Mar 28 15:43:38 2024 -0400
Start notes on Mars as a base.
Tip
Use git log --oneline
to compress output to oneline.
Note
Git does not track directories but it tracks files in directories.
Add the following to mars.txt.
An ill-considered change.
git diff HEAD mars.txt
diff --git a/mars.txt b/mars.txt
index 2a282ee..1fe3425 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1,3 +1,4 @@
Cold and dry, but everything is my favorite color.
The two moons may be a problem for Wolfman.
But the Mummy will appreciate the lack of humidity.
+An ill-considered change.s
git diff mars.txt
and git diff HEAD
will produce the same results as git diff HEAD mars.txt
.
Append "\~" followed by a number to compare a certain number of commits back. A ~1
indicates the commit previous to HEAD.
git diff HEAD~1 mars.txt
For two commits previous to HEAD, use ~2
.
git diff HEAD~2 mars.txt
diff --git a/mars.txt b/mars.txt
index 077071c..1fe3425 100644
--- a/mars.txt
+++ b/mars.txt
@@ -1 +1,4 @@
Cold and dry, but everything is my favorite color.
+The two moons may be a problem for Wolfman.
+But the Mummy will appreciate the lack of humidity.
+An ill-considered change.
Reverting to previous version
The last line, "An ill-considered change." added to mars.txt has not been staged for commit yet although git status
will indicate that mars.txt has been modified.
git status
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: mars.txt
no changes added to commit (use "git add" and/or "git commit -a")
To go back to the version of mars.txt prior to adding "An ill-considered change.", use the following.
git checkout HEAD mars.txt
cat mars.txt
Cold and dry, but everything is my favorite color.
The two moons may be a problem for Wolfman.
But the Mummy will appreciate the lack of humidity.
Definition
"git checkout checks out (i.e., restores) an old version of a file" -- https://swcarpentry.github.io/git-novice/05-history.html
Tip
To revert to a specific version, supply the commit ID to git checkout
.
git checkout 27d9e09 mars.txt
cat mars.txt
Cold and dry, but everything is my favorite color.
The modification from checking out the commit ID corresponding to version 1 of mars.txt are in the staging area, but not commited.
git status
On branch main
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: mars.txt
To go back to the most recent commited version for mars.txt.
git checkout HEAD mars.txt
Cold and dry, but everything is my favorite color.
The two moons may be a problem for Wolfman.
But the Mummy will appreciate the lack of humidity.
Ignoring things (optional)
Sometimes you don't want git to track certain files (e.g. backup files created by your text editor, large data files, intermediate analysis files). This can be achieved using a special configuration file .gitignore
Let's add some dummy files:
mkdir results
touch a.csv b.csv c.csv results/a.out results/b.out
and see what Git says:
git status
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
a.csv
b.csv
c.csv
results/
nothing added to commit but untracked files present (use "git add" to track)
Putting these files under version control would be a waste of disk space. What's worse, having them all listed could distract us from changes that actually matter, so let's tell Git to ignore them.
We do this by creating a file in the root directory of our project called .gitignore
:
nano .gitignore
Add the following text to the .gitignore file
*.csv
results/
These patterns tell Git to ignore any file whose name ends in .csv
and everything in the results
directory. (If any of these files were already being tracked, Git would continue to track them.)
Once we have created this file, the output of git status
is much cleaner:
git status
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
nothing added to commit but untracked files present (use "git add" to track)
The only thing Git notices now is the newly-created .gitignore
file. You might think we wouldn't want to track it, but everyone we're sharing our repository with will probably want to ignore the same things that we're ignoring. Let's add and commit .gitignore
:
git add .gitignore
git commit -m "Ignore data files and the results folder"
git status
On branch main
nothing to commit, working tree clean
As a bonus, using .gitignore
helps us avoid accidentally adding files to the repository that we don't want to track:
git add a.csv
The following paths are ignored by one of your .gitignore files:
a.csv
Use -f if you really want to add them.
If we really want to override our ignore settings, we can use git add -f
to force Git to add something. For example, git add -f a.csv
. We can also always see the status of ignored files if we want:
git status --ignored
On branch main
Ignored files:
(use "git add -f <file>..." to include in what will be committed)
a.csv
b.csv
c.csv
results/
nothing to commit, working tree clean
Tip
You can set up a global .gitignore
file to be used in all of your projects. You will need to set your global config to point to it : git config --global core.excludesFile '~/.gitignore'
Remotes in Git (optional)
Note
Before you start -- this assumes you have set up an ssh private/public key to work with github. See the instructions here.
Now that we've finished work on our project locally, we'd like to share it with our collaborators / the world. To this end we are going to create a remote repository that will be linked to our local repository.
1. Create a remote repository
Log in to GitHub, then click on the icon in the top right corner to create a new repository called planets
:
Name your repository "planets" and then click "Create Repository".
Note: Since this repository will be connected to a local repository, it needs to be empty. Leave "Initialize this repository with a README" unchecked, and keep "None" as options for both "Add .gitignore" and "Add a license." See the "GitHub License and README files" exercise below for a full explanation of why the repository needs to be empty.
As soon as the repository is created, GitHub displays a page with a URL and some information on how to configure your local repository:
This effectively does the following on GitHub's servers:
mkdir planets
cd planets
git init
2. Connect local to remote repository
Now we connect the two repositories. We do this by making the GitHub repository a remote for the local repository. The home page of the repository on GitHub includes the URL string we need to identify it:
Definition
A remote (of a repository): A version control repository connected to another, in such way that both can be kept in sync exchanging commit. -- https://swcarpentry.github.io/git-novice/reference.html#remote
Click on the 'SSH' link to change the protocol from HTTPS to SSH.
Copy that URL from the browser, go into the local planets
repository, and run this command:
git remote add origin git@github.com:desireetillo/planets.git
origin
is a local name used to refer to the remote repository. It could be called anything, but origin
is a convention that is often used by default in git and GitHub, so it's helpful to stick with this unless there's a reason not to.
We can check that the command has worked by running git remote -v
:
git remote -v
and we should see the following output:
origin git@github.com:desireetillo/planets.git (fetch)
origin git@github.com:desireetillo/planets.git (push)
3. Push local changes to a remote
Now that authentication is setup, we can return to the remote. This command will push the changes from our local repository to the repository on GitHub:
git push origin main
If you set up a passphrase, with your ssh key, you will be prompted for it
Enumerating objects: 16, done.
Counting objects: 100% (16/16), done.
Delta compression using up to 8 threads.
Compressing objects: 100% (11/11), done.
Writing objects: 100% (16/16), 1.45 KiB | 372.00 KiB/s, done.
Total 16 (delta 2), reused 0 (delta 0)
remote: Resolving deltas: 100% (2/2), done.
To https://github.com/desireetillo/planets.git
* [new branch] main -> main
We can pull changes from the remote repository to the local one as well:
git pull origin main
From https://github.com/desireetillo/planets
* branch main -> FETCH_HEAD
Already up-to-date.
Pulling has no effect in this case because the two repositories are already synchronized. If someone else had pushed some changes to the repository on GitHub or if you added a file on the GitHub (e.g. a LICENSE or a README file), though, the above command would download them to our local repository.
Further Reading
Git resources
-
Printable Git cheatsheet. More material is available from the GitHub training website.
-
An interactive one-page visualisation about the relationships between workspace, staging area, local repository, upstream repository, and the commands associated with each (with explanations).
-
Open Scientific Code using Git and GitHub - A collection of explanations and short practical exercises to help researchers learn more about version control and open source software.