Centralise Your Sources!

The fact that Git doesn’t require a central repository but can function as a completely distributed system for version control is often touted as its main benefit. People who prefer having a central repository are sometimes told they just don’t get it. I think that the distributed nature of Git might be a beneficial thing for some teams of people, but I don’t see that it works very well for all situations. Maybe I don’t get it (but of course I think I do :)).

Here’s my concrete example of when I think distributed version control is a bad choice. At work, we have 5-6 teams of developers working on part of a fairly large service-oriented system. These teams work on around 50 different source projects that contain the source code for various services and shared libraries. We’re using Scrum, which means that we want to ensure every team is empowered to make the changes it needs to fully implement a front-end feature. This means that every team has the right to make modifications to any one of these 50 projects. There are of course limits to the freedom – the total number of source projects at Shopzilla is over 100 (maybe closer to 200, I never counted), and the remaining source code is outside of the things that me and my colleagues in other teams are working on. But the subset of our systems I am using as an example consists of probably 50 projects that are worked on by around 20 developers in 5 teams with different goals.

A snapshot of some of the Shopzilla Git repositories

As if this picture wasn’t complicated enough, each team typically has a sprint length of 2 weeks after which finished features go live (although one team does releases every week). Some features are larger than what we can finish in one sprint, so we normally have a few branches for the main project and a couple of libraries that are affected by a large feature. So we’ll have the release branch in case we need a hot fix, master for things going into the next release, and 0-2 story branches for longer stories – per team. Naturally the rate of change varies a lot; some libraries are very stable meaning that they will only have one branch, while the top-level projects (typically corresponding to a shopping comparison site) change all the time and are likely to have at least three branches concurrently being worked on.

Of course, in a situation with so many moving parts, you need to be able to:

reliably manage what dependencies go into what builds,
reliably manage what builds are used where,
detect conflicts between changes and teams quickly, and
detect regression bugs quickly and reliably.

All that requires quick and easy information exchange between teams about code changes. If we would use distributed version control, we would have to figure out how to consolidate differences between 25-30 developers’ repositories. That is hard and would take a lot of time away from actual coding not to mention adding lots of mistakes due to broken dependencies and configuration management mistakes.

When I say hard, I mean that partly in a mathematical sense. Fully distributed version control is actually an example of a complete graph. That means that the number of possible/necessary communication links between different repositories increases quickly as the number of repositories increases (it is n*(n-1)/2 for those who want the full formula). So, for two people, you only have one link that needs to be updated. If there are three repositories, you need two links, and four repositories leads to six links. When you hit 10 repositories, there are 45 possible ways that code can be pulled, and at 20, 190. This large number of possible ways to get code updates (and get them wrong) is unmanageable. As far as I understand, the practice in teams using distributed source control is to have “code lieutenants” whose role is essentially to divide the big graph of developer repositories into smaller sub-graphs, thereby reducing the number of links.

Distributed source control with lieutenants

In the case above, there are 3 * 5(5 – 1)/2 + 3(3-1) /2 = 33 connections between 15 developers instead of 15(15-1) / 2 = 105.

Now, the way it looks with a centralised system is much simpler as the number of developers increases. This corresponds to a star graph, where the number of connections is equal to the number of developers.

I guess it is conceivable that with say 25 developers you could handle the number of connections required to manage the source code for a few repositories even with distributed code management. But when you have a situation with 50 source projects and a number of branches, I think you need to do everything in your power to remove necessary interaction paths.

Even supposing that the communication problems are less serious than I think they are, I’m unsure of the advantages to using a distributed version control system in a corporate setting. Well, why beat about the bush: I don’t see any. Wikipedia has an article listing some more or less dubious advantages of distributed source control, where many of them relate to an implementation detail of Git rather than distributed source control as such (the use of the local repository that saves a lot of network overhead and speeds up many commands). For me, the main advantage of centralised source control in a corporate setting is that you really want to be able to quickly and reliably confirm that teams that are supposed to be working together actually are. Centralised source control lends itself to continuous integration with heavy usage of automated testing to detect conflicting changes that lead to build failures or regression bugs. Distributed source control doesn’t.

We’re using Git at Shopzilla, and I’m loving it. For me, the main reason not to use Git is that it is hard to understand. I guess one obstacle to understanding Git is that its distributed nature is new and different, but I kind of feel that the importance of that is overstated and therefore it should be deemphasised. The other main hurdle, that it has “a completely baffling user interface that makes perfect sense IF YOU’RE A VULCAN” is a harder problem to get around. But if you work with a tool 8 hours a day, you’ll learn to use it even if it is complicated. What I love about Git isn’t that it is distributed, but that:

It is super-fast.
It’s architecture is fundamentally right – a collection of immutable objects identified by keys that are derived from the objects (the hash codes), with a set of labels (branches, tags, etc.) pointing into relevant places. This just works for a VCS.
Its low-level nature and the correctness of the architecture means that when I make mistakes, I can always fix them. And there seems to be no way of getting a Git repository into a corrupt state even when you manually poke its internals.
It excels at managing branches – I think the way we’re working at Shopzilla is right for us, and I think that if we had stayed with SVN, we couldn’t do it.

Git

This entry was posted on February 25, 2010, 17:56 and is filed under Code Management, Software Development. You can follow any responses to this entry through RSS 2.0. You can leave a response, or trackback from your own site.

#1 by Jakub Narebski on February 28, 2010 - 15:38

You can use centralized workflow with distributed version control systems (DVCS) like Git. Usually it is done by having maintainer pull from leaf contributors (who do not need to push to each other, only pull from central repository).

You can set up Git in such way that more than one developer can push to a single central repository…

- #2 by pettermahlen on March 1, 2010 - 07:45
  
  Jakub: exactly – that’s the way we’re using it at Shopzilla, and that’s the way I think Git should be used (for sure in a commercial setting and based on my more limited experience of open source, in an open source setting, too).
  
#3 by Felipe Cypriano on March 25, 2010 - 11:38

Or you could use a multi-VCS approach, if centralized VCS are a must.

Like having Subversion as the main of everything and use git-svn to interact with svn. I do this because I work remotely and my company uses svn, so I have an local git repository that I push (git svn dcommit) to the company’s central svn.

It makes my work really faster than using only svn.

This article from Martin Fowler is interesting, go to the Multiple VCS section if you don’t wanna read everything.

Petter Måhlén's Blog