Archive for February, 2010

Centralise Your Sources!

The fact that Git doesn’t require a central repository but can function as a completely distributed system for version control is often touted as its main benefit. People who prefer having a central repository are sometimes told they just don’t get it. I think that the distributed nature of Git might be a beneficial thing for some teams of people, but I don’t see that it works very well for all situations. Maybe I don’t get it (but of course I think I do :)).

Here’s my concrete example of when I think distributed version control is a bad choice. At work, we have 5-6 teams of developers working on part of a fairly large service-oriented system. These teams work on around 50 different source projects that contain the source code for various services and shared libraries. We’re using Scrum, which means that we want to ensure every team is empowered to make the changes it needs to fully implement a front-end feature. This means that every team has the right to make modifications to any one of these 50 projects. There are of course limits to the freedom – the total number of source projects at Shopzilla is over 100 (maybe closer to 200, I never counted), and the remaining source code is outside of the things that me and my colleagues in other teams are working on. But the subset of our systems I am using as an example consists of probably 50 projects that are worked on by around 20 developers in 5 teams with different goals.

A snapshot of some of the Shopzilla Git repositories

As if this picture wasn’t complicated enough, each team typically has a sprint length of 2 weeks after which finished features go live (although one team does releases every week). Some features are larger than what we can finish in one sprint, so we normally have a few branches for the main project and a couple of libraries that are affected by a large feature. So we’ll have the release branch in case we need a hot fix, master for things going into the next release, and 0-2 story branches for longer stories – per team. Naturally the rate of change varies a lot; some libraries are very stable meaning that they will only have one branch, while the top-level projects (typically corresponding to a shopping comparison site) change all the time and are likely to have at least three branches concurrently being worked on.

Of course, in a situation with so many moving parts, you need to be able to:

  • reliably manage what dependencies go into what builds,
  • reliably manage what builds are used where,
  • detect conflicts between changes and teams quickly, and
  • detect regression bugs quickly and reliably.

All that requires quick and easy information exchange between teams about code changes. If we would use distributed version control, we would have to figure out how to consolidate differences between 25-30 developers’ repositories. That is hard and would take a lot of time away from actual coding not to mention adding lots of mistakes due to broken dependencies and configuration management mistakes.

When I say hard, I mean that partly in a mathematical sense. Fully distributed version control is actually an example of a complete graph. That means that the number of possible/necessary communication links between different repositories increases quickly as the number of repositories increases (it is n*(n-1)/2 for those who want the full formula). So, for two people, you only have one link that needs to be updated. If there are three repositories, you need two links, and four repositories leads to six links. When you hit 10 repositories, there are 45 possible ways that code can be pulled, and at 20, 190. This large number of possible ways to get code updates (and get them wrong) is unmanageable. As far as I understand, the practice in teams using distributed source control is to have “code lieutenants” whose role is essentially to divide the big graph of developer repositories into smaller sub-graphs, thereby reducing the number of links.

Distributed source control with lieutenants

In the case above, there are 3 * 5(5 – 1)/2 + 3(3-1) /2 = 33 connections between 15 developers instead of 15(15-1) / 2 = 105.

Now, the way it looks with a centralised system is much simpler as the number of developers increases. This corresponds to a star graph, where the number of connections is equal to the number of developers.

I guess it is conceivable that with say 25 developers you could handle the number of connections required to manage the source code for a few repositories even with distributed code management. But when you have a situation with 50 source projects and a number of branches, I think you need to do everything in your power to remove necessary interaction paths.

Even supposing that the communication problems are less serious than I think they are, I’m unsure of the advantages to using a distributed version control system in a corporate setting. Well, why beat about the bush: I don’t see any. Wikipedia has an article listing some more or less dubious advantages of distributed source control, where many of them relate to an implementation detail of Git rather than distributed source control as such (the use of the local repository that saves a lot of network overhead and speeds up many commands). For me, the main advantage of centralised source control in a corporate setting is that you really want to be able to quickly and reliably confirm that teams that are supposed to be working together actually are. Centralised source control lends itself to continuous integration with heavy usage of automated testing to detect conflicting changes that lead to build failures or regression bugs. Distributed source control doesn’t.

We’re using Git at Shopzilla, and I’m loving it. For me, the main reason not to use Git is that it is hard to understand. I guess one obstacle to understanding Git is that its distributed nature is new and different, but I kind of feel that the importance of that is overstated and therefore it should be deemphasised. The other main hurdle, that it has “a completely baffling user interface that makes perfect sense IF YOU’RE A VULCAN” is a harder problem to get around. But if you work with a tool 8 hours a day, you’ll learn to use it even if it is complicated. What I love about Git isn’t that it is distributed, but that:

  • It is super-fast.
  • It’s architecture is fundamentally right – a collection of immutable objects identified by keys that are derived from the objects (the hash codes), with a set of labels (branches, tags, etc.) pointing into relevant places. This just works for a VCS.
  • Its low-level nature and the correctness of the architecture means that when I make mistakes, I can always fix them. And there seems to be no way of getting a Git repository into a corrupt state even when you manually poke its internals.
  • It excels at managing branches – I think the way we’re working at Shopzilla is right for us, and I think that if we had stayed with SVN, we couldn’t do it.


Git and Maven

There was a recent comment to a bug I posted in the Maven Git SCM Provider that triggered some thoughts. The comment was:

“GIT is a distributed SCM. There IS NO CENTRAL repository. Accept it.

Doing a push during the release process is counter to the GIT model.”

In general, the discussions around that bug have been quite interesting and very different from what I expected when I posted it. My reason for calling it a bug was that an unqualified ‘push‘ tries to push everything in your local git repository to the origin repository. That can fail for some branch that you’ve not kept up to date even if it is a legal operation for the branch that you’re currently doing a release of. Typically, that other branch has moved a bit, so your version is a couple of commits behind. A push in that state will abort the maven release process and leave you with some pretty tricky cleaning up to do (edit: Marta has posted about how to fix that). A lot of people commenting on the bug have made comments about how Git is distributed and therefore push shouldn’t be done at all, or be made optional.

I think that the issue here is that there is an impedance mismatch between Git and Maven. While Git is a distributed version control system – that of course also supports a centralised model perfectly well – the Maven model is fundamentally a centralised one. This is one case where the two models conflict, and my opinion is that the push should indeed happen, just in a way that is less likely to break. The push should happen because when doing a Maven release, supporting Maven’s centralised model is more important than supporting Git’s distributed model.

The main reason why Maven needs to be centralised is the way that artifact versions are managed. If releasing can be done by different people from local repositories without any central coordination, there is a big risk of different people creating artifact versions that are not the same. The act of creating a Maven release is in fact saying that “This binary package is version 2.1 of this artifact, and it will never change”. There should never be two versions of 2.1. Git of course gets around this problem using hashes of the things it version controls instead of sequential numbers, and if two things are identical, they will have the same hash code = the same version number. Maven produces artifacts on a higher conceptual level, where sequential version numbers are important, so there needs to be a central location that determines what is the next version number to use and provides a ‘master’ copy of the published artifacts.

I’ve also thought a bit about centralised versus distributed version management and when the different choices might work, but I think I’ll leave that for another post at another time (EDIT: that time was now). Either way, I think that regardless of the virtues of distributed version management systems like Git, Maven artifacts need to be managed centrally. It would be interesting to think about what a distributed dependency management system would look like…



The Suddenness of Trouble

Sometimes, things go wrong in projects, no matter how carefully we work to prevent them. I’ve noticed that when a crisis hits, people often say (and I always feel) “why didn’t you notice earlier?”. Typically, in hindsight, there’s been multiple signs that things are not going in the right direction, but nobody has taken any forceful action to correct them. And suddenly, we’re in crisis mode, with everybody doing everything possible to fix what looks likely to be a disaster.

It’s a bit like what happens when you slowly pour grains of sand onto a pile. The pile keeps getting higher, then suddenly something breaks and there’s an avalanche.

I wonder if this is good, bad or neither. I think it is probably a little bit bad, it must be better to be able to detect and react to the signs of trouble while they are still signs as opposed to actual trouble.

More to the point, though, I wonder if the avalanches are avoidable. I guess the only way to do so is to always treat any kind of problem as a mini-crisis and act forcefully to deal with it. It seems like the cost of doing that – much of the time one would be chasing ghost problems – is probably too high. So it would be my best guess that having ‘step’ reactions to problems and feeling that ‘I should have realised sooner’ is pretty much the way things should be. Hopefully, with experience, the grains of sand will line up earlier so the avalanches are smaller.