Introduction

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What do I do when I need to make complex decisions with my git respository?

How do I collaborate on a software project with others?

Objectives

Understand the range of functionality that exists in git.

Understand the different challenges that arrise with collaborative projects.

Git Refresher

Git is a version control system for tracking changes in computer files and coordinating work on those files among multiple people. It is primarily used for source code management in software development but it can be used to track changes in files in general - it is particularly effective for tracking text-based files (e.g. source code files, CSV, Markdown, HTML, CSS, Tex, etc. files).

Git has several important characteristics:

support for non-linear development allowing you and your colleagues to work on different parts of a project concurrently,
support for distributed development allowing for multiple people to be working on the same project (even the same file) at the same time,
every change recorded by Git remains part of the project history and can be retrieved at a later date, so even if you make a mistake you can revert to a point before it.

It uses a distributed version control model (the “beautiful graph theory tree model”), meaning that there is no single central repository of code. Instead, users share code back and forth to synchronise their repositories, and it is up to each project to define processes and procedures for managing the flow of changes into a stable software product.

Git is powerful and flexible to fit a wide range of use cases and workflows from simple projects written by a single contributor to projects that are millions of lines and have hundreds of co-authors. Furthermore, it does a task that is quite complex. As a result, many users may find it challenging to navigate this complexity. While committing and sharing changes is fairly straightforward, for instance, but recovering from situations such as accidental commits, pushes or bad merges is difficult without a solid understanding of the rather large and complex conceptual model. Case in point, three of the top five highest voted questions on Stack Overflow are questions about how to carry out relatively simple tasks: undoing the last commit, changing the last commit message, and deleting a remote branch.

XKCD Git

Mouse-over text: If that doesn’t fix it, git.txt contains the phone number of a friend of mine who understands git. Just wait through a few minutes of ‘It’s really pretty simple, just think of branches as…’ and eventually you’ll learn the commands that will fix everything.

With this lesson our goal is to give a you a more in-depth understanding of the conceptual model of git, to guide you through increasingly complex workflows and to give you the confidence to participate in larger projects.

The diagram below shows a typical software development lifecycle with Git (in our case starting from making changes in a local branch that “tracks” a remote branch) and the commonly used commands to interact with different parts of the Git infrastructure, including:

working tree - a local directory (including any subdirectories) where your project files live and where you are currently working. It is also known as the “untracked” area of Git or “working directory”. Any changes to files will be marked by Git in the working tree. If you make changes to the working tree and do not explicitly tell Git to save them - you will likely lose those changes. Using git add filename command, you tell Git to start tracking changes to file filename within your working tree.
staging area (index) - once you tell Git to start tracking changes to files (with git add filename command), Git saves those changes in the staging area on your local machine. Each subsequent change to the same file needs to be followed by another git add filename command to tell Git to update it in the staging area. To see what is in your working tree and staging area at any moment (i.e. what changes is Git tracking), run the command git status.
local repository - stored within the .git working tree of your project locally, this is where Git wraps together all your changes from the staging area and puts them using the git commit command. Each commit is a new, permanent snapshot (checkpoint, record) of your project in time, which you can share or revert to.
remote repository - this is a version of your project that is hosted somewhere on the Internet (e.g., on GitHub, GitLab or somewhere else). While your project is nicely version-controlled in your local repository, and you have snapshots of its versions from the past, if your machine crashes - you still may lose all your work. Furthermore, you cannot share or collaborate on this local work with others easily. Working with a remote repository involves pushing your local changes remotely (using git push) and pulling other people’s changes from a remote repository to your local copy (using git fetch or git pull) to keep the two in sync in order to collaborate (with a bonus that your work also gets backed up to another machine). Note that a common best practice when collaborating with others on a shared repository is to always do a git pull before a git push, to ensure you have any latest changes before you push your own.

Software development lifecycle with Git

Git Version Control Tool

To test your Git installation, type:

$ git help

If your Git installation is working you should see something like:

usage: git [-v | --version] [-h | --help] [-C <path>] [-c <name>=<value>]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p | --paginate | -P | --no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           [--config-env=<name>=<envvar>] <command> [<args>]

These are common Git commands used in various situations:

start a working area (see also: git help tutorial)
   clone     Clone a repository into a new directory
   init      Create an empty Git repository or reinitialize an existing one

work on the current change (see also: git help everyday)
   add       Add file contents to the index
   mv        Move or rename a file, a directory, or a symlink
   restore   Restore working tree files
   rm        Remove files from the working tree and from the index

examine the history and state (see also: git help revisions)
   bisect    Use binary search to find the commit that introduced a bug
   diff      Show changes between commits, commit and working tree, etc
   grep      Print lines matching a pattern
   log       Show commit logs
   show      Show various types of objects
   status    Show the working tree status

grow, mark and tweak your common history
   branch    List, create, or delete branches
   commit    Record changes to the repository
   merge     Join two or more development histories together
   rebase    Reapply commits on top of another base tip
   reset     Reset current HEAD to the specified state
   switch    Switch branches
   tag       Create, list, delete or verify a tag object signed with GPG

collaborate (see also: git help workflows)
   fetch     Download objects and refs from another repository
   pull      Fetch from and integrate with another repository or a local branch
   push      Update remote refs along with associated objects

'git help -a' and 'git help -g' list available subcommands and some
concept guides. See 'git help <command>' or 'git help <concept>'
to read about a specific subcommand or concept.
See 'git help git' for an overview of the system.

When you use Git on a machine for the first time, you need to configure a few things:

your name,
your email address (the one you used to open your GitHub account with, which will be used to uniquely identify your commits),
preferred text editor for Git to use (e.g. nano or another text editor of your choice),
whether you want to use these settings globally (i.e. for every Git project on your machine).

This can be done from the command line as follows:

$ git config --global user.name "Your Name"
$ git config --global user.email "name@example.com"
$ git config --global core.editor "nano -w"

GitHub Account

GitHub is a free, online host for Git repositories that you will use during the course to store your code in so you will need to open a free GitHub account unless you do not already have one.

Secure Access To GitHub Using Git From Command Line

In order to access GitHub using Git from your machine securely, you need to set up a way of authenticating yourself with GitHub through Git. The recommended way to do that for this course is to set up SSH authentication - a method of authentication that is more secure than sending passwords over HTTPS and which requires a pair of keys - one public that you upload to your GitHub account, and one private that remains on your machine.

GitHub provides full documentation and guides on how to:

A short summary of the commands you need to perform is shown below.

To generate an SSH key pair, you will need to run the ssh-keygen command from your command line tool/GitBash and provide your identity for the key pair (e.g. the email address you used to register with GitHub) via the -C parameter as shown below. Note that the ssh-keygen command can be run with different parameters - e.g. to select a specific public key algorithm and key length; if you do not use them ssh-keygen will generate an RSA key pair for you by default. GitHub now recommends that you use a newer cryptographic standard (such as EdDSA variant algorithm Ed25519), so please be sure to specify it using the -t flag as shown below. It will also prompt you to answer a few questions - e.g. where to save the keys on your machine and a passphrase to use to protect your private key. Pressing ‘Enter’ on these prompts will get ssh-keygen to use the default key location (within .ssh folder in your home directory) and set the passphrase to empty.

$ ssh-keygen -t ed25519 -C "your-github-email@example.com"

Generating public/private ed25519 key pair.
Enter file in which to save the key (/Users/<YOUR_USERNAME>/.ssh/id_ed25519): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /Users/<YOUR_USERNAME>/.ssh/id_ed25519
Your public key has been saved in /Users/<YOUR_USERNAME>/.ssh/id_ed25519.pub
The key fingerprint is:
SHA256:qjhN/iO42nnYmlpink2UTzaJpP8084yx6L2iQkVKdHk your-github-email@example.com
The key's randomart image is:
+--[ED25519 256]--+
|.. ..            |
| ..o A           |
|. o..            |
| .o.o .          |
| ..+ =  B        |
| .o = ..         |
|o..X *.          |
|++B=@.X          |
|+*XOoOo+         |
+----[SHA256]-----+

Next, you need to copy your public key (not your private key - this is important!) over to your GitHub account. The ssh-keygen command above will let you know where your public key is saved (the file should have the extension “.pub”), and you can get its contents (e.g. on a Mac OS system) as follows:

$ cat /Users/<YOUR_USERNAME>/.ssh/id_ed25519.pub

ssh-ed25519 AABAC3NzaC1lZDI1NTE5AAAAICWGVRsl/pZsxx85QHLwSgJWyfMB1L8RCkEvYNkP4mZC your-github-email@example.com

Copy the line of output that starts with “ssh-ed25519” and ends with your email address (it may start with a different algorithm name based on which one you used to generate the key pair and it may have gone over multiple lines if your command line window is not wide enough).

Finally, go to your GitHub Settings -> SSH and GPG keys -> Add New page to add a new SSH public key. Give your key a memorable name (e.g. the name of the computer you are working on that contains the private key counterpart), paste the public key from your clipboard into the box labelled “Key” (making sure it does not contain any line breaks), then click the “Add SSH key” button.

Now, we can check that the SSH connection is working:

$ ssh -T git@github.com

What About Passwords?

While using passwords over HTTPS for authentication is easier to setup and will allow you read access to your repository on GitHub from your machine, it alone is not sufficient any more to allow you to send changes or write to your remote repository on GitHub. This is because, on 13 August 2021, GitHub has strengthened security requirements for all authenticated Git operations. This means you would need to use a personal access token instead of your password for added security each time you need to authenticate yourself to GitHub from the command line (e.g. when you want to push your local changes to your code repository on GitHub). While using SSH key pair for authentication may seem complex, once set up, it is actually more convenient than keeping track of/caching your access token.

Key Points

Git version control records text-based differences between files.

Each git commit records a change relative to the previous state of the documents.

Git has a range of functionality that allows users to manage the changes they make.

This complex functionality is especially useful when collaborating on projects with others

Forks

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What are forks?

Objectives

Understand how forks are created.

A fork of a repository is a new repository which shares code and history with the repository it was forked from - it is a server-side copy (clone) of the original repository. This is particularly useful when you want to work on an open source project where you don’t have write permissions to the repository.

Creating a fork and a local copy of the fork

In order to try out the commands in this lesson we need to set up a repository on GitHub:

Go to https://github.com/NOC-OI/intermediate-git-test-repo
Click on the Fork button on the top right and follow the instructions. When this is process is done, you will be directed to your copy of the repository on GitHub.
Click the green Code button. Copy the SSH or HTTPS path to the repository to your local machine. Do not download a ZIP file.
Create a local copy. The command will be similar to this but with your user name:
```
git clone git@github.com:<user-name>/intermediate-git-test-repo.git
```

Exercise 1: Create a fork

Follow the above instructions to create a fork of the intermediate-git-test-repo. Then create a local copy of it.

We’ll discuss forking further in the Remotes and Branching Models chapters.

Key Points

A fork is a server-side copy of a repository

A fork can be created on Github through the Fork button in the top right

Branches

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What are branches?

How do I view the current branches?

How do I manipulate branches?

Objectives

Understand how branches are created.

Learn the key commands to view and manipulate branches.

Branching is a feature available in most modern version control systems. Branching in other version control systems can be an expensive operation in both time and disk space. In git, branches are a part of your everyday development process. When you want to add a new feature or fix a bug—no matter how big or how small—you spawn a new branch to encapsulate your changes. This makes it harder for unstable code to get merged into the main code base, and it gives you the chance to clean up your future’s history before merging it into the main branch.

Git Branching

The diagram above visualizes a repository with two isolated lines of development, one for a little feature, and one for a longer-running feature. By developing them in branches, it’s not only possible to work on both of them in parallel, but it also keeps the main branch free from questionable code.

The implementation behind Git branches is much more lightweight than other version control system models. Instead of copying files from directory to directory, Git stores a branch as a reference to a commit. In this sense, a branch represents the tip of a series of commits—it’s not a container for commits. The history for a branch is extrapolated through the commit relationships.

(https://www.atlassian.com/git/tutorials/using-branches)

What is a branch?

In git a branch is effectively a pointer to a snapshot of your changes. It’s important to understand that branches are just pointers to commits. When you create a branch, all Git needs to do is create a new pointer, it doesn’t change the repository in any other way. If you start with a repository that looks like this:

Git Branching

Then, you create a branch using the following command:

git branch crazy-experiment

The repository history remains unchanged. All you get is a new pointer to the current commit:

Git Branching

Note that this only creates the new branch. To start adding commits to it, you need to select it with git switch, and then use the standard git add and git commit commands.

A branch also means an independent line of development. Branches serve as an abstraction for the edit/stage/commit process. New commits are recorded in the history for the current branch, which results in a fork in the history of the project. However, it is really important to remember that each commit only records the incremental change in the document and NOT the full history of changes. Therefore, while we think of a branch as a sequence of commits, each commit is an independent unit of change.

Branching Commands

Creating, deleting, and modifying branches is quick and easy; here’s a summary of the commands:

To list all branches:

git branch

git branch -avv

To create a new branch named <branch>, which references the same point in history as the current branch.

git branch <branch>

To create a new branch named <branch>, referencing <start-point>, which may be specified any way you like, including using a branch name or a tag name:

git branch <branch> <start-point>

To delete the branch <branch>; if the branch is not fully merged in its upstream branch or contained in the current branch, this command will fail with a warning:

git branch -d <branch>

To delete the branch <branch> irrespective of its merged status:

git branch -D <branch>

To switch to a different branch <branch>, updating the working directory to reflect the version referenced by <branch>.

git switch <branch>

To create a new branch <new> referencing <start-point>, and switch to it.

git switch -c <new> <start-point>

The special symbol "HEAD" can always be used to refer to the current branch. In fact, Git uses a file named HEAD in the .git directory to remember which branch is current:

$ cat .git/HEAD
ref: refs/heads/main

Renaming a branch can be done with the -m tag:

git branch -m <old-branch-name> <new-branch-name>

Exercise 1: Create a feature branch

Create a feature branch in your local copy of intermediate-git-test-repo. Make a change to the plot_buoys.py script, such as changing the colour of the buoy symbols on line 28. Commit your change and push it to your fork on Github.
Solution
git branch featue-branch
git switch featue-branch
Make some edit to plot_buoys.py.
git add plot_buoys.py
git commit -m "Change colour of buoy marker"
git push origin feature-branch

Key Points

A branch represents an independent line of development.

git branch creates a new pointer to the current state of the repository and allows you to make subsequent changes from that state.

Subsequent changes are considered to belong to that branch.

The final commit on a given branch is its HEAD.

Remote Repositories

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I work in remote teams and with remotely hosted code?

Objectives

Learn to work with multiple remotes

https://www.atlassian.com/git/tutorials/syncing

Git’s distributed collaboration model, which gives every developer their own copy of the repository, complete with its own local history and branch structure. Users typically need to share a series of commits rather than a single “changeset”. Instead of committing a “changeset” from a working copy to the central repository, Git lets you share entire branches between repositories.

Git remote

The git remote command lets you create, view, and delete connections to other repositories. Remote connections are more like bookmarks rather than direct links into other repositories. Instead of providing real-time access to another repository, they serve as convenient names that can be used to reference a not-so-convenient URL.

Remote Schematic

For example, the diagram above shows two remote connections from your repo into the central repo and another developer’s repo. Instead of referencing them by their full URLs, you can pass the origin and john shortcuts to other Git commands.

The git remote command is essentially an interface for managing a list of remote entries that are stored in the repository’s ./.git/config file. The following commands are used to view the current state of the remote list.

Git is designed to give each developer an entirely isolated development environment. This means that information is not automatically passed back and forth between repositories. Instead, developers need to manually pull upstream commits into their local repository or manually push their local commits back up to the central repository. The git remote command is really just an easier way to pass URLs to these “sharing” commands.

View Remote Configuration

To list the remote connections of your repository to other repositories you can use the git remote command:

git remote

If you test this in the local repo you’ve created, you should get only one connection, origin:

origin

When you clone a repository with git clone, git automatically creates a remote connection called origin pointing back to the cloned repository. This is useful for developers creating a local copy of a central repository, since it provides an easy way to pull upstream changes or publish local commits. This behaviour is also why most Git-based projects call their central repository origin.

We can ask git for a more verbose (-v) answer which gives us the URLs for the connections:

git remote -v

For our training repository this should return:

origin	git@github.com:<user-name>/intermediate-git-test-repo.git (fetch)
origin	git@github.com:<user-name>/intermediate-git-test-repo.git (push)

As expected these point to the repository you cloned.

Create and Modify Connections

The git remote command also lets you manage connections with other repositories. The following commands will modify the repo’s ./.git/config file. The result of the following commands can also be achieved by directly editing the ./.git/config file with a text editor.

Create a new connection to a remote repository. After adding a remote, you’ll be able to use ＜name＞ as a convenient shortcut for ＜url＞ in other Git commands.

git remote add <name> <url>

Remove the connection:

git remote rm <name>

Rename a connection:

git remote rename <old-name> <new-name>

To get high-level information about the remote ＜name＞:

git show <name>

Exercise 1: Add a connection to your neighbour’s repository.

Having this kind of access to individual developers’ repositories makes it possible to collaborate outside of the central repository. This can be very useful for small teams working on a large project.
Solution
git remote add john http://dev.example.com/john.git

Multiple remotes

Remember that when you create a new branch without specifying a starting point, then the starting point will be the current state and branch. In order to avoid confusion, ALWAYS branch from the stable version. Here is how you would branch from your own origin/main branch:

git fetch origin main
git branch <branch> origin/main
git switch <branch>

You must fetch first so that you have the most recent state of the repository.

If there is another “true” version/state of the project, then this connection may be set as upstream (or something else). Upstream is a common name for the stable repository, then the sequence will be:

git fetch upstream main
git branch <branch> upstream/main
git switch <branch>

Now we can set the NOC-OI version of our repository as the upstream for our local copy.

Exercise 2: Setting upstream and checking out remote branches.

Set the https://github.com/NOC-OI/intermediate-git-test-repo repo as the upstream locally. Then check which branches exist upstream and create a local branch with the starting point of the upstream branch called develop.
Solution
git remote add upstream git@github.com:NOC-OI/intermediate-git-test-repo.git
git fetch upstream
git branch -vv -a
git branch develop upstream/develop
git switch develop
Now examine the state of your repository with git branch, git branch -vva (to see all the branches), git remote -v and git remote show upstream.

We can see that we are now on a local branch develop, which is configured to track the develop branch in upstream. Running git push and git pull in this branch will automatically push to the upstream branch. We can verify this with

$ git pull -v

This branch has a small commit which is not in your origin remote.

Exercise 3: Pushing to origin.

Push these changes we’ve pulled down to own remote repository (origin).
Solution
$ git push origin develop 

We can configure as many remotes as we like. If you work closely with friends or colleagues, it could be common for you to want to pull interesting changes from their remotes, incorporate those into your current branches, and push those changes to your remote.

Key Points

The git remote command allows us to create, view and delete connections to other repositories.

Remote connections are like bookmarks to other repositories.

Other git commands (git fetch, git push, git pull) use these bookmarks to carry out their syncing responsibilities.

Branching Models

Overview

Teaching: 0 min
Exercises: 0 min

Questions

What is a branching model?

Why do you need one?

What are the most common branching models?

Objectives

Learn about the importance of a branching model.

What is a branching model/strategy?

Branches are primarily used as a means for teams to develop features giving them a separate workspace for their code. These branches are usually merged back to a main branch upon completion of work. (You may come across the term ‘master’, rather than ‘main’. This used to be the usual term, but is now gradually becoming less common.) In this way, features (and any bug and bug fixes) are kept apart from each other allowing you to fix mistakes more easily.

This means that branches protect the mainline of code and any changes made to any given branch don’t affect other developers.

A branching strategy, therefore, is the strategy that software development teams adopt when writing, merging and deploying code when using a version control system.

It is essentially a set of rules that developers can follow to stipulate how they interact with a shared codebase.

Such a strategy is necessary as it helps keep repositories organized to avoid errors in the application and the dreaded merge hell when multiple developers are working simultaneously and are all adding their changes at the same time. Such merge conflicts would eventually deter the combination of contributions from multiple developers.

Thus, adhering to a branching strategy will help solve this issue so that developers can work together without stepping on each other’s toes. In other words, it enables teams to work in parallel to achieve faster releases and fewer conflicts by creating a clear process when making changes to source control.

When we talk about branches, we are referring to independent lines of code that branch off the main branch, allowing developers to work independently before merging their changes back to the code base.

In this and the following episodes, we will outline some of the branching strategies that teams use in order to organize their workflow where we will look at their pros and cons and which strategy you should choose based on your needs, objectives and your team’s capabilities.

Why do you need a branching model?

As mentioned above, having a branching model is necessary to avoid conflicts when merging and to allow for the easier integration of changes into the main trunk.

A BRANCHING MODEL AIMS TO:

Enhance productivity by ensuring proper coordination among developers
Enable parallel development
Help organize a series of planned, structured releases
Map a clear path when making changes to software through to production
Maintain a bug-free code where developers can quickly fix issues and get these changes back to production without disrupting the development workflow

Git Branching Models

Some version control systems are Very Opinionated about the branching models that can be used. git is very much (fortunately or unfortunately) not. This means that there are many different ways to do development in a team and the team needs to explicitly agree on how and when to merge contributions to the main branch. So the first rule of git granching is: “Talk about your branching model.” The second rule is: “Talk about your branching model.” If in doubt, do what other people around you are doing. If they don’t do anything, call a friend.

That said, there are a number of established (and less so) branching models that are used with git. These include, but are not limited to:

Centralized workflow: enables all team members to make changes directly to the main branch. Every change is logged into the history. In this workflow, the contributors do not use other branches. Instead they all make changes on the main branch directly and commit to it. This works for individual developers or small teams which communicate very well, but can be tricky for larger teams: the code is in constant state of flux and developers keep changes local until they are ready to release.
Trunk-based development: is somewhat similar to the centralized workflow. The development happens on a single branch called trunk. When changes need to be merged, each developer pulls and rebases from the trunk branch and resolves conflicts locally. This can work if small merges are made frequently and is more successful if there is CI/CD.
Feature branch workflow: every small change or “feature” gets its own branch where the developers make changes. Once the feature is done, they submit a merge/pull request and merge it into the main branch. Features branches should be relatively short-lived. The benefit of this model is that the main branch is not poluted by unfinished features. Good for teams.
Gitflow: is a model where the main development happens in a develop branch with feature branches. When the develop branch is ready for a release (or to go into production), a team member creates a release branch which is tested and eventually merged onto the dev and eventually main branch.
GitHub flow: similar to the branching workflow. (Further info)
GitLab flow: is a simplified version of Gitflow. (Further info)
Oneflow: is similar to Gitflow but relies on the maintanance of one long-lived branch. It is meant to be simpler, without a develop branch but feature branches still exist. (Further info)
Forking workflow (e.g. astropy): is a model where each contributor creates a fork or a complete copy of the repository. Every contributor effectively has two repositories: his own and the main (upstream) one. Changes are made as pull requests against the main repository. This model is popular with open source projects because the vast majority of contributors do not need to have priviledges in the main repository.

A longer description of some of these can be found here.

In summary, there are many different ways to collaborate on a project. Look at the pros and cons and select one that fits the needs and organization of your team and project. In the following several sections we look at some of these models in more detail.

Feature Branch Workflow

While it is technically OK to commit your changes directly to main branch, and you may often find yourself doing so for some minor changes, the best practice is to use a new branch for each separate and self-contained unit/piece of work you want to add to the project. This unit of work is also often called a feature and the branch where you develop it is called a feature branch. Each feature branch should have its own meaningful name - indicating its purpose (e.g. “issue23-fix”). If we keep making changes and pushing them directly to the main branch on GitHub, then anyone who downloads our software from there will get all of our work in progress - whether or not it is ready to use! So, working on a separate branch for each feature you are adding is good for several reasons:

it enables the main branch to remain stable while you and the team explore and test the new code on a feature branch,
it enables you to keep the untested and not-yet-functional feature branch code under version control and backed up,
you and other team members may work on several features at the same time independently from one another, and
if you decide that the feature is not working or is no longer needed - you can easily and safely discard that branch without affecting the rest of the code.

Gitflow Workflow

In the Gitflow workflow, we typically have a main branch which is the version of the code that is tested, stable and reliable. Then, we normally have a development branch (called develop or dev by convention) that we use for work-in-progress code. As we work on adding new features to the code, we create new feature branches that first get merged into develop after a thorough testing process. After even more testing - develop branch will get merged into main. The points when feature branches are merged to develop, and develop to main depend entirely on the practice/strategy established in the team. For example, for smaller projects (e.g. if you are working alone on a project or in a very small team), feature branches sometimes get directly merged into main upon testing, skipping the develop branch step. In other projects, the merge into main happens only at the point of making a new software release. Whichever is the case for you, a good rule of thumb is - nothing that is broken should be in main.

An example is shown in the diagram below.

An example of Gitflow workflow
Adapted from Git Tutorial by sillevl (Creative Commons Attribution 4.0 International License)

Forking Workflow

The forking workflow is popular among open source software projects and often used in conjunction with a branching model.

The focus of this workflow is to keep the “upstream main” stable while allowing anyone to work on their own contributions independently. Contributions are then suggested and accepted via pull requests. There is not necessarily a develop branch, but you may have release branches.

GitFlow 1

Source: GitHub

In order to understand the forking workflow, let’s first take a look at some special words and roles needed (we’ve already talked about some of these today!):

upstream - Remote repository containing the “true copy”

origin - Remote repository containing the forked copy

Pull request(PR) - Merge request from fork to upstream (a request to add your suggestions to the “original copy”)

Maintainer - Someone with write access to upstream who vets PRs

Contributor - Someone who contributes to upstream via PRs

Release manager - A maintainer who also oversees releases

Here is some info about workflows used in a couple of projects as real life examples:

Example release workflow for the astropy Python package

Spacetelescope (STScI) style guide for release workflow

Exercise 1: Suggest your changes via pull request

Earlier in this workshop, you pushed a feature branch up to origin in which you had made a small change to plot_buoys.py. Go to your repository (your fork) on GitHub and find the tab called “Pull requests”. Click the green “new pull request” button. Then find and click the blue link uder “Compare changes” called “compare across fork”. Select your username and branch name from the right menus. Then click the big green button under the menus called “create pull request”.

Key Points

A branching model is a pre-agreed way of merging branches into the main branch.

A branching model is needed when multiple contributors are making changes to a single project.

Merging

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How do I merge a branch changes?

Objectives

Learn about git merge.

When you are collaborating, you will have to merge a branch independent if your branch may or may not have diverged from the main branch. Most of the Git hosting platform like GiHub or GitLab allows you to merge a branch from their web interface but you can also merge the branches from your machine using git merge.

There are 2 ways to merge:

non-fast-forward merged (recommended)
fast forward merged

Merging diagram.

Reminder: when starting work on a new feature, be careful where you branch from!

git remote add upstream https://github.com/mpi-astronomy/advanced-git-training.git
git fetch upstream
git checkout -b develop upstream/develop

Non-fast-forward Merge

Merges branch by creating a merge commit. Prompts for merge commit message. Ideal for merging two branches.

git checkout main
git merge --no-ff <branch> -m "Message"

The --no-ff flag causes the merge to always create a new commit object, even if the merge could be performed with a fast-forward. This avoids losing information about the historical existence of a feature branch and groups together all commits that together added the feature.

Exercise: Creating a non-fast-forward merge.

Create a new Git repository that has the following tree.

*   69fac81 (main) Merge branch 'gitignore'
|\  
| * 5537012 (gitignore) Add .gitignore
|/  
* 6ec7c0f Add README

Solution

git init
touch README.md
git add README.md
git commit -m 'Add README'
git checkout -b gitignore
touch .gitignore
git add .gitignore
git commit -m "Add .gitignore"
git checkout main
git merge --no-ff gitignore

Fast-forward Merge

If there are no conflicts with the main branch, a “fast-forward” merge can be executed with. This will NOT create a merge commit! Aborts merge if it cannot be done. Ideal for updating a branch from remote.

git checkout main
git merge --ff-only <branch>

If using the fast-forward merge, it is impossible to see from the git history which of the commit objects together have implemented a feature. You would have to manually read all the log messages. Reverting a whole feature (i.e. a group of commits), is a true headache in the latter situation, whereas it is easily done if the –no-ff flag was used.

For a good illustration of fast-forward merge (and other concepts), see this thread.

Exercise: Creating a fast-forwad merge.

Consider the following Git tree
* a78b99f (main) Add title
| * 3d88062 (remote) Add .gitignore
|/  
* 86c4247 Add README
Is possible to run a fast-forward merge to incorporate the branch remote into main?

Solution

It is not possible to run a fast-forward merge because of commit a78b99f.

Three-way Merge

Similar to --no-ff, but there may be dragons. Forced upon you when there’s an intermediate change since you branched. May prompt your to manually resolve

git merge <branch> [-s <strategy>]

See here for a zillion options (“patience”, “octopus”, etc), But also git is only so smart and you are probably smarter.

See here and here for some discussion of merging strategies.

Note: there are a number of external tools that have a graphical interface to allow for merge conflict resolution. Some of these include: kdiff3 (Windows, Mac, Linux), Meld (Windows, Linux), P4Merge (Windows, Mac, Linux), opendiff (Mac), vimdiff (for Vim users), Beyond Compare, GitHub web interface. We do not endorse any of them and use at your own risk. In any case, using a graphical interface does not substitute for understanding what is happening under the hood.

Resolving Conflicts

Squashing

Key Points

git merge --no-ff is the best way to merge changes

git merge --ff-only is a good way to pull down changes from remote

Advanced Logging

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How can I find the commits I’m looking for?

Objectives

Learn to search for specific changes

The purpose of any version control system is to record changes to your code. This gives you the power to go back into your project history to see who contributed what, figure out where bugs were introduced, and revert problematic changes. But, having all of this history available is useless if you don’t know how to navigate it. That’s where the git log command comes in.

We’ve already encountered the git log command for displaying commits. But, we can alter this output by passing many different parameters to git log.

The advanced features of git log can be split into two categories: formatting how each commit is displayed, and filtering which commits are included in the output. Together, these two skills give us the power to go back into your project and find any information that you could possibly need.

Let’s grab a repository from the web to start working with

$ git clone git@github.com:sa2c/example-gitflow-modified.git ~/example-gitflow-modified

Formatting Log Output

First, let’s take a look at the many ways in which git log’s output can be formatted. Most of these come in the form of flags that let you request more or less information from git log.

If you don’t like the default git log format, you can use git config’s aliasing functionality to create a shortcut for any of the formatting options discussed below. Please see in The git config Command for how to set up an alias.

Oneline

The --oneline flag condenses each commit to a single line. By default, it displays only the commit ID and the first line of the commit message. If we type

$ git log --oneline

We’ll get output which will look something like this:

0e25143 Merge branch 'feature'
ad8621a Fix a bug in the feature
16b36c6 Add a new feature
23ad9ad Add the initial code base

This is very useful for getting a high-level overview of your project.

Diffs

The git log command includes many options for displaying diffs with each commit. Two of the most common options are --stat and -p.

The --stat option displays the number of insertions and deletions to each file altered by each commit (note that modifying a line is represented as 1 insertion and 1 deletion). Let’s try this

$ git log --stat

This is useful when you want a brief summary of the changes introduced by each commit. For example, the following commit added 67 lines to the hello.py file and removed 38 lines:

commit f2a238924e89ca1d4947662928218a06d39068c3
Author: John <john@example.com>
Date: Fri Jun 25 17:30:28 2014 -0500
Add a new feature
hello.py | 105 ++++++++++++++++++++++++-----------------
1 file changed, 67 insertion(+), 38 deletions(-)

The amount of + and - signs next to the file name show the relative number of changes to each file altered by the commit. This gives you an idea of where the changes for each commit can be found.

If you want to see the actual changes introduced by each commit, you can pass the -p option to git log.

$ git log -p

This outputs the entire patch representing that commit:

commit 16b36c697eb2d24302f89aa22d9170dfe609855b
Author: Mary <mary@example.com>
Date: Fri Jun 25 17:31:57 2014 -0500
Fix a bug in the feature
diff --git a/hello.py b/hello.py
index 18ca709..c673b40 100644
--- a/hello.py
+++ b/hello.py
@@ -13,14 +13,14 @@ B
-print("Hello, World!")
+print("Hello, Git!")

For commits with a lot of changes, the resulting output can become quite long and unwieldy. More often than not, if you’re displaying a full patch, you’re probably searching for a specific change. For this, you want to use the pickaxe option.

Graphs

The --graph option draws (using text) a graph representing the branch structure of the commit history. This is commonly used in conjunction with the --oneline and --decorate commands to make it easier to see which commit belongs to which branch:

$ git log --graph --oneline --decorate

For a simple repository with just 2 branches, this will produce the following:

* 0e25143 (HEAD, master) Merge branch 'feature'
|\
| * 16b36c6 Fix a bug in the new feature
| * 23ad9ad Start a new feature
* | ad8621a Fix a critical security issue
|/
* 400e4b7 Fix typos in the documentation
* 160e224 Add the initial code base

The asterisk shows which branch the commit was on, so the above graph tells us that the 23ad9ad and 16b36c6 commits are on a topic branch and the rest are on the master branch.

While this is a nice option for simple repositories, you’re probably better off with a more full-featured visualization tool like gitk or Sourcetree for projects that are heavily branched.

Filtering the Commit History

Formatting how each commit gets displayed is only half the battle of learning git log. The other half is understanding how to navigate the commit history. The rest of this episode introduces some of the advanced ways to pick out specific commits in your project history using git log. All of these can be combined with any of the formatting options discussed above.

By Amount

The most basic filtering option for git log is to limit the number of commits that are displayed. When you’re only interested in the last few commits, this saves you the trouble of viewing all the commits in a page.

You can limit git log’s output by including the -<n> option. For example, the following command will display only the 3 most recent commits.

$ git log -3

By Date

If you’re looking for a commit from a specific time frame, you can use the --after or --before flags for filtering commits by date. These both accept a variety of date formats as a parameter. For example, the following command only shows commits that were created after July 1st, 2014 (inclusive):

$ git log --after="2014-7-1"

You can also pass in relative references like “1 week ago” and “yesterday”:

$ git log --after="yesterday"

To search for a commits that were created between two dates, you can provide both a --before and --after date. For instance, to display all the commits added between July 1st, 2014 and July 4th, 2014, you would use the following:

$ git log --after="2014-7-1" --before="2014-7-4"

Alternatives

Note that the --since and --until flags are synonymous with --after and --before, respectively.

By Author

When you’re only looking for commits created by a particular user, use the --author flag. This accepts a regular expression, and returns all commits whose author matches that pattern. If you know exactly who you’re looking for, you can use a plain old string instead of a regular expression:

$ git log --author="John"

This displays all commits whose author includes the name John. The author name doesn’t need to be an exact matchit just needs to contain the specified phrase.

You can also use regular expressions to create more complex searches. For example, the following command searches for commits by either Mary or John.

$ git log --author="John\|Mary"

Note that the author’s email is also included with the author’s name, so you can use this option to search by email, too.

If your workflow separates committers from authors, the --committer flag operates in the same fashion.

By Message

To filter commits by their commit message, use the --grep flag. This works just like the --author flag discussed above, but it matches against the commit message instead of the author.

For example, if your team includes relevant issue numbers in each commit message, you can use something like the following to pull out all of the commits related to that issue:

$ git log --grep="JRA-224:"

You can also pass in the -i parameter to git log to make it ignore case differences while pattern matching.

By File

Many times, you’re only interested in changes that happened to a particular file. To show the history related to a file, all you have to do is pass in the file path. For example, the following returns all commits that affected either the foo.py or the bar.py file:

$ git log -- foo.py bar.py

The -- parameter is used to tell git log that subsequent arguments are file paths and not branch names. If there’s no chance of mixing it up with a branch, you can omit the --.

By Content (Pickaxe)

It’s also possible to search for commits that introduce or remove a particular line of source code. This is called a pickaxe, and it takes the form of -S"<string>". For example, if you want to know when the string Hello, World! was added to any file in the project, you would use the following command:

$ git log -S"Hello, World!"

If you want to search using a regular expression instead of a string, you can use the -G"<regex>" flag instead.

This is a very powerful debugging tool, as it lets you locate all of the commits that affect a particular line of code. It can even show you when a line was copied or moved to another file.

By Range

You can pass a range of commits to git log to show only the commits contained in that range. The range is specified in the following format, where <since> and <until> are commit references:

$ git log <since>..<until>

This command is particularly useful when you use branch references as the parameters. It’s a simple way to show the differences between 2 branches. Consider the following command:

$ git log master..feature

The master..feature range contains all of the commits that are in the feature branch, but aren’t in the master branch. In other words, this is how far feature has progressed since it forked off of master.

Detecting a fork in the history using ranges

Note that if you switch the order of the range (feature..master), you will get all of the commits in master, but not in feature. If git log outputs commits for both versions, this tells you that your history has diverged.

Filtering Merge Commits

By default, git log includes merge commits in its output. But, if your team has an always-merge policy (that is, you merge upstream changes into topic branches instead of rebasing the topic branch onto the upstream branch), you’ll have a lot of extraneous merge commits in your project history.

You can prevent git log from displaying these merge commits by passing the --no-merges flag:

$ git log --no-merges

On the other hand, if you’re only interested in the merge commits, you can use the --merges flag:

$ git log --merges

This returns all commits that have at least two parents.

Summary

You should now be fairly comfortable using git log’s advanced parameters to format its output and select which commits you want to display. This gives you the power to pull out exactly what you need from your project history.

These new skills are an important part of your Git toolkit, but remember that git log is often used in conjunction other Git commands. Once you’ve found the commit you’re looking for, you typically pass it off to git checkout, git revert, or some other tool for manipulating your commit history. So, be sure to keep on learning about Git’s advanced features.

Key Points

We’ve been introduced powerful logging commands

Identifying breaking commits

Overview

Teaching: 30 min
Exercises: 10 min

Questions

How can I use git to track down problems in code?

Objectives

Learn to identify when and in what commit problems were introduced

Episode setup

First we need to pull down some code from a remote repository, we will need an example with some broken code which can be found in the broken branch of our example repository.

$ cd ~/Desktop

and clone the code

$ git clone git@github.com:NOC-OI/intermediate-git-test-repo.git

and change into the fresh repository and switch to the broken branch.

$ cd intermediate-git-test-repo
$ git switch broken

Tracking down a broken commit

Let’s take a look at the contents of this repository

$ ls

We see a small number of files; let’s have a look inside plot_bouys.py.

$ nano plot_buoys.py

Let’s try to run the code

$ python plot_bouys.py

This clearly has a problem, as expect. Let’s look at the log history to see if we can spot it.

$ git log --oneline

If we looked at this for a while, can could probably spot the commit that might be causing the issue, the commit labelled “changing function to plot_data”. In reality however, finding the problem wouldn’t be this simple. In general, we might not know what file the problem is in, or where in that file. We may have hundreds of files with hundreds of lines each, and no idea where to start looking. Let’s start by looking at the initial commit.

$ git checkout 2890

And see if the plot_buoys.py script runs here.

$ python plot_buoys.py

The file runs with no problems from an earlier commit, somewhere since this commit something went wrong. In this section, we will explore ways in which we can investigate the sources of errors. Let’s move back to the head of the broken branch.

$ git checkout broken

Tracking down broken commits with `git blame`

If we know where the problem is in the file, we might ask ourselves what (and who) introduced this problem. What commit introduced this line. Let’s try this with

$ git blame plot_buoys.py

We see that most lines were created in the same two commits, but some were modified in other commits. There are a lot of lines here, let’s focus on the range of lines 57 to 61 (the part not in a function)

$ git blame -L 57,61 plot_buoys.py

That’s better. Let’s take a closer look at the commit on line 61.

$ git show 4445

That’s interesting. We have found a change to that line, but not the one which altered the function name. Let’s try going back a bit in the history with git checkout and do this again.

$ git checkout HEAD~1
$ git blame -L 57,61 plot_buoys.py

This still hasn’t found the commit which renamed the function, let’s try going back further.

$ git checkout HEAD~1
$ git blame -L 57,61 plot_buoys.py

We can see that the problematic line was brought in during commit eecf. Multiple commits after something breaks can make git blame a little harder to use.

Challenge: Using git blame across files

We can ask git blame to attempt to track changes across files. For example where code is copied and pasted from one file to another or where files are renamed using git mv. We can do this by specifying the -C option to git blame. Use git blame -C to identify which lines of plot_buoys.py came from another file. Then use git show or git checkout to examine the contents of this file.
Solution
$ git blame -C plot_buoys.py
This came from description.txt in commit 73592708. We can examine this commit with:
$ git show 7359
or
$ git checkout 7359
$ cat description.txt
$ git checkout broken #get back to the head of the branch

Binary searching with Git

We could checkout each commit one at a time, and check each one, but this is very time consuming. We’d have to check out each commit one at a time, like this

$ git checkout HEAD~7
$ git checkout HEAD~6
$ git checkout HEAD~5
...
$ git checkout HEAD~3
$ git checkout HEAD~2
$ git checkout HEAD~1

We can do better than this if we choose a half way point between the bad and good commit, check if that is good or bad, and keep choosing a half way point until we find the commit that causes the code to go from good to bad. Git can actually help us do this with the git bisect command. Let’s try it, first let’s make sure we have reset HEAD to the most recent commit on the broken branch.

$ git checkout broken
$ git bisect start

We mark the current commit as bad

$ git bisect bad HEAD

Then we can mark the commit from the merge as good

$ git bisect good 116c

Git will now drop us at a commit half way between the good and the bad commits, which should be commit 2890. We can verify this with

$ git log --oneline broken

We see some commits marked as bad and good, and git has placed us in the middle commit. Now we can test this commit

$ python plot_buoys.py

It works! The code wasn’t broken at this point. Let’s mark this commit as good

$ git bisect good

Great, git has moved us again. Let’s check where we are this time, it should be commit d022

$ git log --oneline broken

The markers for good and bad have moved, because we’ve given bisect more information, and HEAD has been placed between them.

$ python plot_buoys.py

This failed, let’s mark this as a bad commit

$ git bisect bad

We found a bad commit, let’s take a look at where we are now:

$ git log --oneline master

Git has marked the good and bad commits, but it doesn’t know yet if the previous commit might have been the first bad one. It needs us to check that. Let’s go ahead and do that

$ python plot_buoys.py

This is also a bad commit, let’s mark it

$ git bisect bad

We’ve now only got one commit left so Git automatically identifies the commit which broke things as eecf. Had we marked our good commit one commit earlier then we could have used git bisect good when we came across the first good commit.

Finally, git has found the commit we were looking for and told us where it is. Let’s see where we are

$ git log --oneline master

Git has marked the relevant commits as bad, but it hasn’t moved us to the first bad commit. It left us in this pending state. Let’s take a look at the content of the breaking commit

$ git show eecf

Git is telling us that the problem was introduced by a change that happened on line 55 of plot_buoys.py where plot_buoy_data() was changed to plot_data(). For us, this was probably a problem that is easy enough to resolve without using bisect, but for a large complex code base when we don’t know where to start, bisect can instantly point us to the change which first caused the problem. Let’s exit the bisect state and go back to master with

$ git bisect reset

This worked great, and we can go through large numbers of commits with this technique, but there was a lot of typing. Can Git do a better job? It turns out that it can. Let’s look at the return value from Python

$ python plot_buoys.py
$ echo $?

The variable $? is a special variable containing the return value of the function. In this case it is non-zero, indicating an error. Let’s look at the historic commit

$ git log --oneline
$ git checkout 2890

And test the code

$ python plot_buoys.py
$ echo $?

In this case the script returns 0, indicating success. This is a common convention in Unix scripts, and you can write your own scripts that follow this convention. Git can use this convention to decide if a commit is good or bad. Let’s try it

$ git bisect start HEAD 2890

Once again, git drops us in the middle of a commit. This time, instead of running python plot_buoys.py, we tell Git to run it for us

$ git bisect run 'python plot_buoys.py'

Git does all the boring work for us. Every time it runs the command we gave and gets a zero return value, it marks the commit as good, every time it sees a non-zero value, it marks the commit as bad. It then tells us the first commit if finds which changes the state of the repository from “good” to “bad”. Now that we’re done, we exit again with

$ git bisect reset

One caveat

This is a very powerful debugging tool, but it relies on all your code being in a runnable state, such that Git can automatically identify when this state changes. It works best when used with a branching and merging strategy, to ensure there are no breaking commits on the main branch.

Key Points

git blame can identify when a problem line was introduced.

git bisect can be used to binary search through git history to identify lines which first introduced a problem.

Overview

Teaching: min
Exercises: min

Questions

Objectives

Setup a repository for the episode

Before we can proceed we need to create a branch with some commits that we will undo in various different ways.

Exercise: Creating a branch.

Create a new branch called hotfix. Create a new file and make 3-4 commits in that file or create 3-4 new files. Check the log to see the hash of the last commit.
Solution
git switch -c hotfix
#or git checkout -b hotfix
touch a.txt
git add a.txt
git commit -m "1st git commit: 1 file"
touch b.txt
git add b.txt 
git commit -m "2nd git commit: 2 file"
touch c.txt
git add c.txt
 git commit -m "3rd git commit: 3 file"
git status
git log --oneline 

Amending a Git Commit message

Git Revert

Reverting undoes a commit by creating a new commit. This is a safe way to undo changes, as it has no chance of re-writing the commit history. For example, the following command will figure out the changes contained in the 2nd to last commit, create a new commit undoing those changes, and tack the new commit onto the existing project.

git revert HEAD~1
ls

GitFlow 1

Note that revert only backs out the atomic changes of the ONE specific commit (by default, you can also give it a range of commits but we are not going to do that here, see the help).

git revert does not rewrite history which is why it is the preferred way of dealing with issues when the changes have already been pushed to a remote repository.

Git Reset

Resetting is a way to move the tip of a branch to a different commit. This can be used to remove commits from the current branch. For example, the following command moves the hotfix branch backwards by two commits.

git checkout hotfix
git reset HEAD~1

GitFlow 1

The two commits that were on the end of hotfix are now dangling, or orphaned commits. This means they will be deleted the next time git performs a garbage collection. In other words, you’re saying that you want to throw away these commits.

git reset also reverts the commits but leaves the uncommitted changes in the repo.

git status
git restore b.txt

git reset is a simple way to undo changes that haven’t been shared with anyone else. It’s your go-to command when you’ve started working on a feature and find yourself thinking, “Oh no, what am I doing? I should just start over.”

Using `git reset` on uncommitted files

In addition to moving the current branch, you can also get git reset to alter the staged snapshot and/or the working directory by passing it one of the following flags:

–soft – The staged snapshot and working directory are not altered in any way.

–mixed – The staged snapshot is updated to match the specified commit, but the working directory is not affected. This is the default option.

–hard – The staged snapshot and the working directory are both updated to match the specified commit.

It’s easier to think of these modes as defining the scope of a git reset operation.

To just undo any uncommitted changes:

git status
git add c.txt
git status
git reset HEAD
git status

Use `git restore` instead of `git reset` for staged but uncommmitted files

The newer git restore command (as suggested by git status) can also be used here, but you need to specify which files to unstage and that you want to unstage.

git status
git add c.txt
git restore --staged c.txt
git status

You can add and commit the changes or restore the file.

Reset a single committed file

git reset can also work on a single file:

Let’s first add some contents to our three text files.

echo "a" > a.txt
echo "b" > b.txt
echo "c" > c.txt
git add a.txt b.txt c.txt
git commit -m "added some file contents"

Now if we want to restore just one of these files to it’s previous (empty) state we can specify it to git reset.

git reset HEAD~1 b.txt
git status

Git Checkout: A Gentle Way

We already saw that git checkout is used to move to a different branch but is can also be used to update the state of the repository to a specific point in the projects history.

git checkout hotfix
git checkout HEAD~2

GitFlow 1

This puts you in a detached HEAD state. AGHRRR!

Most of the time, HEAD points to a branch name. When you add a new commit, your branch reference is updated to point to it, but HEAD remains the same. When you change branches, HEAD is updated to point to the branch you’ve switched to. All of that means that, in these scenarios, HEAD is synonymous with “the last commit in the current branch.” This is the normal state, in which HEAD is attached to a branch.

The detached HEAD state is when HEAD is pointing directly to a commit instead of a branch. This is really useful because it allows you to go to a previous point in the project’s history. You can also make changes here and see how they affect the project.

echo "Welcome to the alternate timeline, Morty!" > new-file.txt
git add new-file.txt
git commit -m "Create new file"
echo "Another line" >> new-file.txt
git commit -a -m "Add a new line to the file"
git log --oneline

If we want to keep the changes we can create another branch for them.

git branch alt-history
git checkout alt-history

If we didn’t want the changes then we can discard them and recover by switching back to the hotfix branch:

git checkout hotfix

https://www.atlassian.com/git/tutorials/resetting-checking-out-and-reverting Also OMG: http://blog.kfish.org/2010/04/git-lola.html

Exercise: Undoing Changes

Create a new branch called hotfix. Create a new file and make 3-4 commits in that file. Check the log to see the hash of the last commit.
Solution
git status
git log
Revert the last commit that we just inserted. Check the history.
Solution
git revert -m 1 <hash>
git log
Completely throw away the last two commits [DANGER ZONE!!!]. Check the status and the log.
Solution
git reset HEAD~2 --hard
git status
git log
Undo another commit but leave it in the staging area. Check the status and log.
Solution
git reset HEAD~1
git status
git log
Wrap it up: add and commit the changes.
Solution
git add .
git commit -m "Message"

Comparing hard, mixed and soft resets

So far we’ve been using the default “mixed” option for git reset. This leaves the working directory alone and puts the files affected into the working directory.

As a reminder let’s recreate our three text files, commit them and then reset the commit to HEAD~1.

touch a.txt b.txt c.txt
git add a.txt b.txt c.txt
git commit -m "adding text files"
git reset --mixed HEAD~1

Let’s check what’s in the working directory and see what git status says.

ls
git status

and finally let’s confirm the commit is no longer there by looking at git log --oneline.

Now let’s repeat the process but this time we’ll do a soft reset. This should leave the files staged when we run git status.

git add a.txt b.txt c.txt
git commit -m "adding text files"
git reset --soft HEAD~1
git status

Finally let’s try a hard reset. This time the files will be completely deleted from the working directory and the repository. Since the soft reset left the files staged there’s no need to do a git add this time.

git commit -m "adding text files"
git reset --hard HEAD~1
git status
ls

This time we see no evidence that these file ever existed, they are gone from the working directory, staging area and repository.

Challenge: When to use which type of reset?

Which reset should you use in the following scenarios:

You have made three commits for three small changes and would prefer they were one bigger commit.

You committed a file that you thought fixed a bug but realised soon after that you made a small mistake. You would like to fix the mistake but have the fix as a single commit.

You have committed a file which was accidentally placed in your Git working directory and should have never been there.

Solution

soft, we are assuming we do git reset HEAD~3`, all three files are now placed in the staging area and a single new commit will commit all three together.

mixed, as we need to make some changes after resetting the commit we don’t want the file added to the staging area. Although we could do a soft reset followed by another git add.

hard, since we don’t want to keep the file after the reset.

Back in time

You can restrict the action of reset to a file with:
git reset -- filename
Make some changes to a file, add that file to the staging area, and use git reset to undo the action of git add.
Solution

Add changes to a file with
$ git add <file>
then reset the files with
$ git reset -- <file>
or
$ git reset HEAD -- <file>
or
git reset HEAD -- <file>
Note how if we leave out HEAD, then git will assume we want to pull from the HEAD reference by default.

Without a HEAD

What happens if we do a hard reset, but leave out the place to copy files from, like this
$ git reset --hard
Can you work out where the files come from Hint: it may help to make some changes to the files in the current directory first.

Solution

If the origin of the files is not specified, it is assumed to be HEAD by default.

Checkout on files

The checkout command from earlier has an important variant when passed files as arguments. In this case they behaves very differently. Let’s reset our repository to the way it is on the remote server to begin with.

$ git reset --hard origin/main

Let’s make two changes, one to plot_buoys.py and one to README.md. In both cases add a line to the file listing yourself as an author of the file. In the Python file this will need to be a comment. Go ahead and add/commit both changes in a single commit.

$ nano README.md
$ nano plot_buoys.py
$ git add README.md plot_buoys.py
$ git commit -m "Adding author information"

Now, let’s perform a checkout, specifying that we’d like the last version of the Python file.

$ git checkout HEAD~1 -- plot_buoys.py

What happened? Previously checkout would have moved HEAD.

$ git log --oneline

We’re still on the same commit, HEAD hasn’t moved at all this time. It doesn’t make sense to move HEAD for some files and keep it in the same place for others, that would get confusing very quickly. Only the file copy operations have been performed. Let’s see what effect this has had.

$ git status

The file plot_buoys.py has been copied from the previous commit HEAD~1 into both our working directory as well and into the staging area. We can verify the changes with

$ git diff --staged

The file plot_buoys.py has changed and nothing else has. In this case git checkout with a file behaves very much like we would expect git reset --hard to behave with files. It overrides the file in the staging area and working directory and resets any changes. For this reason

$ git reset --hard HEAD~1 -- plot_buoys.py

This is not a valid command, since it would perform the same operation as the git checkout command.

Reset with files

Using git reset with files allow us to copy specific files to and from the staging area, leaving the working directory unchanged. Let’s reset our repository to the way it was at the beginning of this lesson

$ git reset --hard origin/main

Let’s make some changes to README.md

$ nano README.md

and copy them to the staging area.

$ git add README.md
$ git status

We can use git reset to copy the version in the repository back, effectively undoing the add.

We can unstage the file with

$ git reset HEAD -- README.md

More recently (as of Git 2.23 in August 2019) the git restore command has been introduced which can also be used to unstage changes and is suggested by git status. Older versions of git suggested the use of git reset.

The equivalent git restore command for the above would have been:

$ git restore README.md

But you may still find a lot of Git tutorials suggesting the use of git reset in this scenario. Either command will work.

The dangers of checkout

What happens if you make some modification to README.md, add these changes to the staging area with
$ git add README.md
and then try to checkout the file with
$ git checkout HEAD -- README.md
Can you guess what will happen? Is this potentially dangerous to do?
Solution

The command
$ git checkout HEAD -- <filename>
will overwrite the file filename, even if there are changes. Be careful as you can lose your changes in this way. This command is a useful way to undo any changes you may have made to the files in your working directory.

The way things were

Can you use the checkout command to create a commit which contains the file README.md as it was 3 commits ago? Hint: because some work in the history was done on a pull request HEAD~3 might not get what you expect, use the commit hash instead.
Solution

HEAD~1 actually takes us all the way back to the first commit in the repository, even though there are 3 prior commits in the history. git log --graph --oneline will reveal that some of the history came from another branch and using HEAD~N doesn’t cover the commits from the branch, but treats them as if they were one commit. Let’s use git log to find a commit hash instead.
$ git log --oneline
116cdda (HEAD -> main, origin/main, origin/HEAD, hotfix) Merge pull request #1 from NOC-OI/create_initial_script
7a760ff Add some basic instructions to the README and credit to the Intermediate Python Course
6c388d0 Tidy up formatting
a62d779 Write first draft of script to plot buoy locations around UK
6d4fb54 Initial commit
So let’s take commit hash a62d instead. The file can be brought into the current directory with
$ git checkout 6b4f -- README.md
All that remains is to create a new commit, with a command such as
$ git commit -m 'README.md as it was 3 commits ago'

Without a HEAD

Can you work out what the following command does
$ git checkout -- README.md
Hint: try making some changes to README.md and running the command.

Solution

This command will revert the file README.md to the state it is in the current commit. This is equivalent to running
$ git checkout HEAD -- README.md
If the commit is not specified, git defaults to using HEAD.

Key Points

Issue Tracking

Overview

Teaching: 10 min
Exercises: 10 min

Questions

How do we track issues with code in GitHub?

Objectives

Understand how to track issues with code in GitHub.

Introduction

The strength of online collaboration platforms such as GitHub does not just lie in the ability to share code. They also allow us to track problems with that code, for multiple developers to work on it independently and bring their changes together and to review those changes before they are accepted.

Tracking issues with code

A key feature of GitHub (as opposed to Git itself) is the issue tracker. This provides us with a place to keep track of any problems or bugs in the code and to discuss them with other developers. Sometimes advanced users will also use issue trackers of public projects to report problems they are having (and sometimes this is misused by users seeking help using documented features of the program).

The broken branch of the code from the identifying breaking commits chapter earlier has a bug with a mismatched function name in plot_buoys.py.

Let’s go ahead and create a new issue in our forked GitHub repository to describe this problem. We can find the issue tracker on the “Issues” tab in the top left of the GitHub page for the repository.

Issue tracker missing in Github

Sometimes when forking a Github repository the issue tracker is disabled. If you do not see an “Issues” tab in your fork of the repository then you can re-enable it by:

Clicking on the Setting button (cog icon in the middle near the top)

Scroll down to the Features section

Tick the Issues box

An Issues tab should now appear on the toolbar near the top of the screen.

Click on this and then click on the green “New Issue” button on the right hand side of the screen. We can then enter a title and description of our issue.

A good issue description should include:

What the problem is, including any error messages that are displayed.
What version of the software it occurred with.
Any relevant information about the system running it, for example the operating system being used.
Versions of any dependent libraries.
How to reproduce it.

After the issue is created it will be assigned a sequential ID number.

Write an issue to describe our bug

Create a new issue in your repository’s issue tracker by doing the following:

Go to the GitHub webpage for your code

Click on the Issues tab

Click on the “New issue” button

Enter a title and description for the issue

Click the “Submit Issue” button to create the issue.

Discussing an issue

Once the issue is created, further discussion can take place with additional comments. These can include code snippets and file attachments such as screenshots or logfiles. We can also reference other issues by writing a # symbol and the number of the other issue. This is sometimes used to identify related issues or if an issue is a duplicate.

Closing an issue

Once an issue is solved then it can be closed. This can be done either by pressing the “Close” button in the GitHub web interface or by making a commit which includes the word “fixes”, “fixed”, “close”, “closed” or “closes” followed by a # symbol and the issue number.

Challenge: Fix and close an issue

Fix the issue of the function name mismatch on the broken branch. You could do this either by renaming the function or changing the call. Commit your changes and add the appropriate text to your commit message to close the issue. Push your changes to your forked repository on Github. Check the issue tracker and ensure it has closed.
Solution
def plot_buoy_data(figure_name):
Becomes
def plot_data(figure_name):
or
    plot_data("bouys_plot.png")
Becomes
    plot__buoy_data("bouys_plot.png")
git commit -m "Correcting function name mismatch, Fixes #1"
git push

Key Points

Github includes an issue tracker for each repository where people can describe and discuss issues with code.

Issues can be opened, commented on and closed from the Github web interface.

Issues can also be closed in a commit message using ‘fixes’, ‘fixed’, ‘close’, ‘closed’ or ‘closes’ followed by a # symbol and the issue number.

Odd things to know about Files

Overview

Teaching: 0 min
Exercises: 0 min

Questions

Why are (large) binary files a problem in Git?

What is Git LFS?

What are the problems with Git LFS?

Objectives

Understanding that Git is not intended for (large) binary files

Learning about the git lfs commands

Understanding the disadvantages of git lfs

Sometimes, you might want to add non-textual data to your Git repositories. Examples for such uses cases in a software project are e.g.

assets for the project documentation like images
test data for your test suite

However, such data is stored in binary formats most of the time. Git’s line-based approach of tracking changes is not suited for this type of data. While Git will work with binary data without any errors, it will internally treat each binary file as a file with one (very long) single line of content. Consequently, if you apply changes to such a file, Git will store the entire file in the commit even if there was a lot of similarity between the two versions of the file. As Git does not “forget” about previous versions of the file, doing this repeatedly and/or with very large files will quickly make your repository grow in size. At some point this will severely impact the performance of all your Git operations from git clone to even git status. It is therefore generally discouraged to use Git to track (large) binary files.

However, the problem of binary files in Git repositories cannot be fully neglected: There is a lot of value for a software project in keeping things together that belong together: Documentation assets belong to the documention they are part of. Therefore we will now explore some options on how to integrate large file handling into Git.

The git lfs subcommand is part of an extension to Git. LFS stands for Large File Storage. It allows you to mark individual files as being large. Git does not apply its normal, line-based approach to tracking changes to these large files, instead they are stored separately and only referenced in the Git data model. During push and pull operations, large files are transmitted separately - requiring the server to support this operation.

For the sake of demonstration, we create a file called report.pdf. We assume that it is a large, binary file in order to show how to handle it with git lfs:

echo "This is a very large report." > report.pdf

Next, we tell Git, that this file should be treated with LFS:

git lfs track report.pdf

Tracking "report.pdf"

Having done so, we can inspect the repository and we learn that a new file .gitattributes was added to the repository.

git status

On branch main

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.gitattributes
	report.pdf

cat .gitattributes

report.pdf filter=lfs diff=lfs merge=lfs -text

Similar to .gitignore this file is part of the repository itself in order to share it with all your collaborators on this project. We therefore craft a commit that contains it:

git add .gitattributes
git commit -m "Setup LFS tracking"

Now, we are ready to add the large file to the repository the same way we would with any other file:

git add report.pdf
git commit -m "Add final report to the repository"

Pushing our commits to the remote repository, we can see in the console output, that our LFS data was transferred to the remote server separately.

git push origin main

Uploading LFS objects: 100% (1/1), 17 B | 0 B/s, done.

Tracking with wildcard patterns

LFS tracking is not limited to explicitly spelled out filenames. Instead, wildcard patterns can be passed to git lfs track. However, you should be careful to quote these patterns, as they might otherwise get expanded by to existing files by your shell. For example, tracking all PDFs with LFS could be achieved with the following command:
git lfs track "*.pdf"

Disadvantages of Git LFS

Although git lfs by design solves the problem of storing large files in Git repositories, there are some practical hurdles that you should consider before introducing LFS into your project:

The git lfs command is a separately maintained extension to the Git core. It is therefore not part of most Git distributions, but needs to be installed separately. Using it in your project will require you to educate your users about LFS and how to install it. Depending on your target audience, you should carefully consider whether the benefits outweigh this disadvantage.

Users that do not have git lfs installed will not be notified by Git. They will see the files, but the content will be Git metadata instead of the actual content. Trying to work with those files will typically produce cryptic error messages.

Some hosting providers - most notably GitHub - apply restrictive quotas to LFS storage. On the free plan, GitHub currently allows 1GB of storage and 1 GB bandwidth per month. As the band width quota counts every single clone by users, LFS should currently be considered unusable on the GitHub free plan.

Key Points

(Large) binary files can grow the repository size immensely and make it unusable

git lfs is an extension that stores large files outside the Git data model

Use of Git LFS is discouraged in many scenarios.

Publishing and Citing Code

Overview

Teaching: 0 min
Exercises: 0 min

Questions

How do I ensure my code is citable?

Objectives

Understand how to archive code to Zenodo and create a digital object identifier (DOI) for a software project (and include that info in CITATION.cff).

Sharing code openly promotes collaboration, transparency, and innovation by allowing others to review, use, and improve the code. It fosters knowledge exchange, accelerates scientific progress, and enhances the reproducibility of research. Additionally, open sharing encourages community contributions and can lead to better-maintained, more reliable software.

Adding a license and other metadata to our code (covered in the previous episode) are the first steps towards sharing the code publicly. There are several other important steps to consider which we will cover here.

Making the code public

By default repositories created on GitHub are private and only their creator can see them. Since we added an open source license to our repository we probably want to make sure people can actually access it.

To make your repository public, if it is not already, go to your repository on GitHub and click on the Settings link near the top right corner. Then scroll down to the bottom of the page and the “Danger Zone” settings. Click on “Change Visibility” and you should see a message saying “Change to public”. If it says “Change to private” then the repository is already public. You will then be asked to confirm that you indeed want to make the repository public and agree to the warning that the code will now be publicly visible. As a security measure, you will then have to put in your GitHub password.

Transferring to an organisation

Currently our repository is under the GitHub “namespace” of our individual user. This is OK for individual projects where we are the sole or at least the main code author, but for bigger and more complex projects it is common to use a GitHub organisation named after our project. If we are a member of an organisation and have the appropriate permissions then we can transfer a repository from our personal namespace to the organisation’s. This can be done with another option in the “Danger Zone” settings, the “Transfer ownership” button. Pressing this will then prompt us as to which organisation we want to transfer the repository to.

Archiving code to Zenodo and obtaining a DOI

Zenodo is a data archive run by CERN. Anybody can upload datasets up to 50GB to it and receive a Digital Object Identifier (DOI). Zenodo’s definition of a dataset is quite broad and can include code - which gives us a way to obtain a DOI for our software.

Let us now look into how we can archive a GitHub repository to Zenodo. Note that, instead of using the real Zenodo website, we will practice with Zenodo Sandbox.

Zenodo Sandbox

Zenodo Sandbox is a testing environment for Zenodo, a repository for research outputs, allowing users to safely experiment with its features without affecting the live system. It is a clone of Zenodo, created for testing purposes, that works exactly the same way as Zenodo you can use it for learning, training, experimenting, and preparing uploads without impacting the primary Zenodo repository until you are ready to publish and release your code (or other research outputs) officially. It will also not create real DOIs for a number of test repositories we use for this course and saturate the DOI space (remember that a DOI, once created, is meant to exist forever).

We can archive our GitHub repository to Zenodo (Sandbox) by doing the following:

Go to the Zenodo Sandbox login page and choose to login with GitHub.
Authorise Zenodo Sandbox to connect to GitHub.
Go to the GitHub page in your Zenodo Sandbox account. This can be found in the pull down menu with your user name in the top right corner of the screen.
You will now have a list of all of your GitHub repositories. Next to each will be an “On” button. If you have created a new repository you might need to press the “Sync” button to update the list of repositories Zenodo Sandbox knows about.
Press the “On” button for the repository you want to archive. If this was successful you will be told to refresh the page.
The repository should now appear in the list of “Enabled” repositories at the top of the screen, but it does not yet have a DOI. To get one we have to make a “release” on GitHub. Click on the repository and then press the green button to create a release. This will take you to GitHub’s release page where you will be asked to give a title and description of the release. You will also have to create a “tag” for your release - a way of having a friendly name for the version of some code in Git instead of using a long hash code. Often we will create a sequential version number for each release of the software and have the tag name match this, for example v1.0 or just 1.0.
If we now refresh the Zenodo Sandbox page for this repository we will see that it has been assigned a DOI.

The DOI does not just link to GitHub, Zenodo will have taken a copy (a snapshot) of our repository at the point where we tagged the release. This means that even if we delete it from GitHub or even if GitHub were ever to go away or remove it, there will still be a copy on Zenodo. Zenodo will allow people to download the entire repository (more accurately, its state at the time it was tagged for release) as a single zip file.

Zenodo will have actually created two DOIs for you. One represents the latest version of the software and will always represent the latest if you make more releases. The other is specific to the release you made and will always point to that version. We can see both of these by clicking on the DOI link in the Zenodo page for the repository.

One of the things which is displayed on this page is a badge image that you can copy the link for and add to the README file in your GitHub repository so that people can find the Zenodo version of the repository. If you click on the DOI image in the Details section of the Zenodo page then you will be shown instructions for obtaining a link to the DOI badge in various formats including Markdown. Here is the badge for this repository and the corresponding Markdown:

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11869450.svg)](https://doi.org/10.5281/zenodo.11869450)

Archive your repository to Zenodo (Sandbox)

Note: for this exercise, as demonstrated earlier, you should use the Sandbox Zenodo (a version of Zenodo for testing and playing with before minting a real DOI). For real software releases, you should use Zenodo.

Create an account on Zenodo Sandbox that is linked to your GitHub account.

Use Zenodo Sandbox to create a release for your repository and obtain a DOI for it.

Get the link to the DOI badge for your repository and add a link to this image to your README file in Markdown format. Check that this is the DOI for the latest version and not the DOI for a specific version, if not you will be updating this every time you make a release.

Problems with GitHub and Zenodo integration

The integration between GitHub and Zenodo does not interact well with some browser privacy features and extensions. Firefox can be particularly problematic with this and might open new tabs to login to GitHub and then give an error saying: Your browser did something unexpected. Please try again. If the error continues, try disabling all browser extensions. If this happens try disabling the extra privacy features/extensions or using another browser such as Chrome.

Adding a DOI and ORCID to the citation file

Now that we have our DOI it is good practice to include this information in our citation file. Earlier we created a CITATION.cff file with information about how to cite our code. There are a few fields we can add now which are related to the DOI; one of these is the version file which covers the version number of the software. We can add a DOI to the file in the identifiers section with a type of doi and value of the Zenodo URL. Optionally we can also add a date-released field indicating the date we released this software. Here is an updated version of our CITATION.cff from the previous episode with a version number, DOI and release date added.

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Spacewalks
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jaffa
    name-particle: Sarah
  - given-names: Aleksandra
    family-names: Nenadic
  - given-names: Kamilla
    family-names: Kopec-Harding
repository-code: >-
  https://github.com/YOUR-REPOSITORY-URL/spacewalks.git
abstract: >-
  A Python script to analyse NASA extravehicular activity
  data
keywords:
  - NASA
  - Extravehicular activity
version: 1.0.1
identifiers:
  - type: doi
    value: 10.5281/zenodo.1234
date-released: 2024-06-01

Add a DOI to your citation file

Add the DOI you were allocated in the previous exercise to your CITATION.cff file and then commit and push the updated version to your GitHub repository. If you used the commit field in your CITATION.cff file before to point to a given version of the code - you can now remove it as using the DOI field is better for this job.

Going further with publishing code

We now have our code published online, licensed as open source, archived with Zenodo, accessible via a DOI and with a citation file to encourage people to cite it. What else might we want to do in order to improve how findable, accessible or reusable it is? One further step we could take is to publish the code with a peer reviewed journal. Some traditional journals will accept software submissions, although these are usually as a supplementary material for a paper. There also journals which specialise in research software such as the Journal of Open Research Software, The Journal of Open Source Software or SoftwareX. With these venues, the submission will be the software itself and not a paper, although a short abstract or description of the software is often required.

Key Points

Intermediate Git Version Control

Introduction

Overview

Git Refresher

Git Version Control Tool

GitHub Account

Secure Access To GitHub Using Git From Command Line

What About Passwords?

Key Points

Forks

Overview

Creating a fork and a local copy of the fork

Exercise 1: Create a fork

Key Points

Branches

Overview

What is a branch?

Branching Commands

Exercise 1: Create a feature branch

Solution

Key Points

Remote Repositories

Overview

Git remote

View Remote Configuration

Create and Modify Connections

Exercise 1: Add a connection to your neighbour’s repository.

Solution

Multiple remotes

Exercise 2: Setting upstream and checking out remote branches.

Solution

Exercise 3: Pushing to origin.

Solution

Key Points

Branching Models

Overview

What is a branching model/strategy?

Why do you need a branching model?

Git Branching Models

Feature Branch Workflow

Gitflow Workflow

Forking Workflow

Exercise 1: Suggest your changes via pull request

Key Points

Merging

Overview

Non-fast-forward Merge

Exercise: Creating a non-fast-forward merge.

Solution

Fast-forward Merge

Exercise: Creating a fast-forwad merge.

Solution

Three-way Merge

Resolving Conflicts

Squashing

Key Points

Advanced Logging

Overview

Formatting Log Output

Oneline

Diffs

Graphs

Filtering the Commit History

By Amount

By Date

Alternatives

By Author

By Message

By File

By Content (Pickaxe)

By Range

Detecting a fork in the history using ranges

Filtering Merge Commits

Summary

Key Points

Identifying breaking commits

Overview

Episode setup

Tracking down a broken commit

Tracking down broken commits with git blame

Tracking down broken commits with `git blame`

Using `git reset` on uncommitted files

Use `git restore` instead of `git reset` for staged but uncommmitted files