This lesson is still being designed and assembled (Pre-Alpha version)

Identifying breaking commits

Overview

Teaching: 30 min
Exercises: 10 min
Questions
  • How can I use git to track down problems in code?

Objectives
  • Learn to identify when and in what commit problems were introduced

Episode setup

First we need to pull down some code from a remote repository, we will need an example with some broken code which can be found in the broken branch of our example repository.

$ cd ~/Desktop

and clone the code

$ git clone git@github.com:NOC-OI/intermediate-git-test-repo.git

and change into the fresh repository and switch to the broken branch.

$ cd intermediate-git-test-repo
$ git switch broken

Tracking down a broken commit

Let’s take a look at the contents of this repository

$ ls

We see a small number of files; let’s have a look inside plot_bouys.py.

$ nano plot_buoys.py

Let’s try to run the code

$ python plot_bouys.py

This clearly has a problem, as expect. Let’s look at the log history to see if we can spot it.

$ git log --oneline

If we looked at this for a while, can could probably spot the commit that might be causing the issue, the commit labelled “changing function to plot_data”. In reality however, finding the problem wouldn’t be this simple. In general, we might not know what file the problem is in, or where in that file. We may have hundreds of files with hundreds of lines each, and no idea where to start looking. Let’s start by looking at the initial commit.

$ git checkout 2890

And see if the plot_buoys.py script runs here.

$ python plot_buoys.py

The file runs with no problems from an earlier commit, somewhere since this commit something went wrong. In this section, we will explore ways in which we can investigate the sources of errors. Let’s move back to the head of the broken branch.

$ git checkout broken

Tracking down broken commits with git blame

If we know where the problem is in the file, we might ask ourselves what (and who) introduced this problem. What commit introduced this line. Let’s try this with

$ git blame plot_buoys.py

We see that most lines were created in the same two commits, but some were modified in other commits. There are a lot of lines here, let’s focus on the range of lines 57 to 61 (the part not in a function)

$ git blame -L 57,61 plot_buoys.py

That’s better. Let’s take a closer look at the commit on line 61.

$ git show 4445

That’s interesting. We have found a change to that line, but not the one which altered the function name. Let’s try going back a bit in the history with git checkout and do this again.

$ git checkout HEAD~1
$ git blame -L 57,61 plot_buoys.py

This still hasn’t found the commit which renamed the function, let’s try going back further.

$ git checkout HEAD~1
$ git blame -L 57,61 plot_buoys.py

We can see that the problematic line was brought in during commit eecf. Multiple commits after something breaks can make git blame a little harder to use.

Challenge: Using git blame across files

We can ask git blame to attempt to track changes across files. For example where code is copied and pasted from one file to another or where files are renamed using git mv. We can do this by specifying the -C option to git blame. Use git blame -C to identify which lines of plot_buoys.py came from another file. Then use git show or git checkout to examine the contents of this file.

Solution

$ git blame -C plot_buoys.py

This came from description.txt in commit 73592708. We can examine this commit with:

$ git show 7359

or

$ git checkout 7359
$ cat description.txt
$ git checkout broken #get back to the head of the branch

Binary searching with Git

We could checkout each commit one at a time, and check each one, but this is very time consuming. We’d have to check out each commit one at a time, like this

$ git checkout HEAD~7
$ git checkout HEAD~6
$ git checkout HEAD~5
...
$ git checkout HEAD~3
$ git checkout HEAD~2
$ git checkout HEAD~1

We can do better than this if we choose a half way point between the bad and good commit, check if that is good or bad, and keep choosing a half way point until we find the commit that causes the code to go from good to bad. Git can actually help us do this with the git bisect command. Let’s try it, first let’s make sure we have reset HEAD to the most recent commit on the broken branch.

$ git checkout broken
$ git bisect start

We mark the current commit as bad

$ git bisect bad HEAD

Then we can mark the commit from the merge as good

$ git bisect good 116c

Git will now drop us at a commit half way between the good and the bad commits, which should be commit 2890. We can verify this with

$ git log --oneline broken

We see some commits marked as bad and good, and git has placed us in the middle commit. Now we can test this commit

$ python plot_buoys.py

It works! The code wasn’t broken at this point. Let’s mark this commit as good

$ git bisect good

Great, git has moved us again. Let’s check where we are this time, it should be commit d022

$ git log --oneline broken

The markers for good and bad have moved, because we’ve given bisect more information, and HEAD has been placed between them.

$ python plot_buoys.py

This failed, let’s mark this as a bad commit

$ git bisect bad

We found a bad commit, let’s take a look at where we are now:

$ git log --oneline master

Git has marked the good and bad commits, but it doesn’t know yet if the previous commit might have been the first bad one. It needs us to check that. Let’s go ahead and do that

$ python plot_buoys.py

This is also a bad commit, let’s mark it

$ git bisect bad

We’ve now only got one commit left so Git automatically identifies the commit which broke things as eecf. Had we marked our good commit one commit earlier then we could have used git bisect good when we came across the first good commit.

Finally, git has found the commit we were looking for and told us where it is. Let’s see where we are

$ git log --oneline master

Git has marked the relevant commits as bad, but it hasn’t moved us to the first bad commit. It left us in this pending state. Let’s take a look at the content of the breaking commit

$ git show eecf

Git is telling us that the problem was introduced by a change that happened on line 55 of plot_buoys.py where plot_buoy_data() was changed to plot_data(). For us, this was probably a problem that is easy enough to resolve without using bisect, but for a large complex code base when we don’t know where to start, bisect can instantly point us to the change which first caused the problem. Let’s exit the bisect state and go back to master with

$ git bisect reset

This worked great, and we can go through large numbers of commits with this technique, but there was a lot of typing. Can Git do a better job? It turns out that it can. Let’s look at the return value from Python

$ python plot_buoys.py
$ echo $?

The variable $? is a special variable containing the return value of the function. In this case it is non-zero, indicating an error. Let’s look at the historic commit

$ git log --oneline
$ git checkout 2890

And test the code

$ python plot_buoys.py
$ echo $?

In this case the script returns 0, indicating success. This is a common convention in Unix scripts, and you can write your own scripts that follow this convention. Git can use this convention to decide if a commit is good or bad. Let’s try it

$ git bisect start HEAD 2890

Once again, git drops us in the middle of a commit. This time, instead of running python plot_buoys.py, we tell Git to run it for us

$ git bisect run 'python plot_buoys.py'

Git does all the boring work for us. Every time it runs the command we gave and gets a zero return value, it marks the commit as good, every time it sees a non-zero value, it marks the commit as bad. It then tells us the first commit if finds which changes the state of the repository from “good” to “bad”. Now that we’re done, we exit again with

$ git bisect reset

One caveat

This is a very powerful debugging tool, but it relies on all your code being in a runnable state, such that Git can automatically identify when this state changes. It works best when used with a branching and merging strategy, to ensure there are no breaking commits on the main branch.

Key Points

  • git blame can identify when a problem line was introduced.

  • git bisect can be used to binary search through git history to identify lines which first introduced a problem.