(This is item 2 in the code sharing cookbook)
Joel Spolsky used what seems to be his last blog post to talk about Git and Mercurial. I like his description of their main benefit as being that they track changes rather than revisions, and like him, I don’t particularly like the classification of them as distributed version control systems. As I’ve mentioned before, the ‘distributed’ bit isn’t what makes them great. In this post, I’ll try to explain why I think that Git is a great VCS, especially for sharing code between multiple teams – I’ve never used Mercurial, so I can’t have any opinions on it. I will use SVN as the counter-example of an older version control system, but I think that in most of the places where I mention SVN, it could be replaced by any other ‘centralised’ VCS.
The by far biggest reason to use Git when sharing code is its support for branching and merging. The main issue at work here is the conflict between two needs: teams need to have complete control of their code and environments in order to be effective in developing their features, and the overall need to detect and resolve conflicting changes as quickly as possible. I’ll probably have to explain a little more clearly what I mean by that.
Assume that Team Red and Team Blue are both working on the same shared library. If they push their changes to the exact same central location, they are likely to interfere with each other. Builds will break, bugs will be introduced in parts of the code supposedly not touched, larger changes may be impossible to make and there will be schedule conflicts – what if Team Blue commits a large and broken change the day before Team Red is going to release? So you clearly want to isolate teams from each other.
On the other hand, the longer the two teams’ changes are isolated, the harder it is to find and resolve conflicting changes. Both volume of change and calendar time are important here. If the volume of changes made in isolation is large and the code doesn’t work after a merge, the volume of code to search in order to figure out the problem is large. This of course makes it a lot harder to figure out where the problem is and how to solve it. On top of that, if a large volume of code has been changed since the last merge, the risk that a lot of code has been built on top of a faulty foundation is higher, which means that you may have wasted a lot of effort on something you’ll need to rewrite.
To explain how long calendar time periods between merges are a problem, imagine that it takes maybe a couple of months before a conflict between changes is detected. At this time, the persons who were making the conflicting changes may no longer remember exactly how the features were supposed to work, so resolving the conflicts will be more complicated. In some cases, they may be working in a totally different team or even have left the company. If the code is complicated, the time when you want to detect and fix the problem is right when you’re in the middle of making the change, not even a week or two afterwards. Branches represent risk and untestable potential errors.
So there is a spectrum between zero isolation and total isolation, and it is clear that the extremes are not where you want to be. That’s very normal and means you have a curve looking something like this:
You have a cost due to team interference that is high with no isolation and is reduced by introducing isolation, and you have a corresponding cost due to the isolation itself that goes up as you isolate teams more. Obviously the exact shape of the curves is different in different situations, but in general you want to be at some point between the extremes, close to the optimum, where teams are isolated enough for comfort, yet merges happen soon enough to not allow the conflict troubles to grow too large.
So how does all that relate to Git? Well, Git enables you to fine-tune your processes on the X axis in this diagram by making merges so cheap that you can do them as often as you like, and through its various features that make it easier to deal with multiple branches (cherry-picking, the ability to identify whether or not a particular commit has gone into a branch, etc.). With SVN, for instance, the costs incurred by frequent merges are prohibitive, partly because making a single merge is harder than with Git, but probably even more because SVN can only tell that there is a difference between two branches, not where the difference comes from. This means that you cannot easily do intermediate merges, where you update a story branch with changes made on the more stable master branch in order to reduce the time and volume of change between merges.
At every SVN merge, you have to go through all the differences between the branches, whereas Git’s commit history for each branch allows you to remember choices you made about certain changes, greatly simplifying each merge. So during the second merge, at commit number 5 in the master branch, you’ll only need to figure out how to deal with (non-conflicting) changes in commits 4 and 5, and during the final merge, you only need to worry about commits 6 and 7. In all, this means that with SVN, you’re forced closer to the ‘total isolation’ extreme than you would probably want to be.
Working with Git has actually totally changed the way I think about branches – I used to say something along the lines of ‘only branch in extreme situations’. Now I think having branches is a good, normal state of being. But the old fears about branching are not entirely invalidated by Git. You still need to be very disciplined about how you use branches, and for me, the main reason is that you want to be able to quickly detect conflicts between them. So I think that branches should be short-lived, and if that isn’t feasible, that relatively frequent intermediate merges should be done. At Shopzilla, we’ve evolved a de facto branching policy over a year of using Git, and it seems to work quite well:
- Shared code with a low rate of change: a single master branch. Changes to these libraries and services are rare enough that two teams almost never make them at the same time. When they do, the second team that needs to make changes to a new release of the library creates a story branch and the two teams coordinate about how to handle merging and releasing.
- Shared code with a high rate of change: semi-permanent team-specific branches and one team has the task of coordinating releases. The teams that work on their different stories/features merge their code with the latest ‘master’ version and tell the release team which commits to pick up. The release team does the merge and update of the release branch and both teams do regression QA on the final code before release. This happens every week for our biggest site.
- Team-specific code: the practice in each team varies but I believe most teams follow similar processes. In my team, we have two permanent branches that interleave frequently: release and master, and more short-lived branches that we create on an ad-hoc basis. We do almost all of our work on the master branch. When we’re starting to prepare a release (typically every 2-3 weeks or so), we split off the release branch and do the final work on the stories to be released there. Work on stories that didn’t make it into the release goes onto the master branch as usual. It is common that we have stories that we put on story-specific branches, when we don’t believe that they will make it into the next planned release and thus shouldn’t be on master.
The diagram above shows a pretty typical state of branches for our team. Starting from the left, the work has been done on the master branch. We then split off the release branch and finalise a release there. The build that goes live will be the last one before merging release back into master. In the mean time, we started some new work on the master branch, plus two stories that we know or believe we won’t be able to finish before the next release, so they live in separate branches. For Story A, we wanted to update it with changes made on the release and master branch, so we merged them into Story A shortly before it was finished. At the time the snapshot is taken, we’ve started preparing the next release and the Story A branch has been deleted as it has been merged back into master and is no longer in use. This means that we only have three branches pointing to commits as indicated by the blueish markers.
This blog post is now far longer than I had anticipated, so I’m going to have to cut the next two advantages of Git shorter than I had planned. Maybe I’ll get back to them later. For now, suffice it to say that Git allows you to do great magic in order to fix mistakes that you make and even extracting and combining code from different repositories with full history. I remember watching Linus Torvalds’ Tech Talk about Git and that he said that the performance of Git was such that it led to a quantum change in how he worked. For me working with Git has also led to a radical shift in how I work and how I look at code management, but it’s not actually the performance that is the main thing, it is the whole conceptual model with tracking commits that makes branching and merging so easy that has led to the shift for me. That Git is also a thousand times (that may not be strictly true…) faster than SVN is of course not a bad thing either.