Archive for category Code Management
When I was at Jadestone, one of the objectives that the CEO explicitly gave me was to come up with a great way to share code between our products. I spent a fair amount of time thinking about and working on how to do that, but I left the company before most of the ideas came to be fully used throughout the company. Just over a year later, I joined Shopzilla, and found that to a very large extent the same ideas that I had introduced as concepts at Jadestone were being used or recommended in practice there. So a lot of the articles I’ve written about code sharing describe ideas and practices from Jadestone and Shopzilla, although focusing specifically on things I personally find important. This one will cover some areas were I think we still have a bit of work to do to really nail things at Shopzilla, although we probably have the beginnings of many of the practices down.
You can’t include all your code into every top-level product you build. This means you’ll need to break your ecosystem down into some sort of sub-structures and pick and choose the parts you want to use in each product. I think it makes sense to have three kinds of overlapping structures: libraries, layers and services. Libraries are the atoms of sharing and since you use Maven, one library corresponds to a single artifact. A library will build on functionality provided by other libraries, and to manage this, they should be organised into layers, where libraries in one layer are more general than libraries in layers above. A lot of the time, you’ll want to have a coarser structure than libraries, for both code management, architectural and operational reasons, and that’s when you create a service that provides some sort of function that will be used in the top-level products.
I’ve found that a lot of the time, code sharing isn’t planned, but emerges. You start with one product, and then there is an opportunity to create a new product that is similar to the first, so you build it on parts of the first. This means that you normally haven’t got a carefully planned set of libraries with well-defined and thought out dependencies between each other and can give rise to some problems:
- Over-large and incoherent libraries. Typical indications of this are a high rate of change and associated high frequency of conflicting changes, forced inclusion of code that you don’t really need into certain builds because stuff you need is packaged with unrelated and irrelevant other code, and difficulties figuring out what dependencies you should have and where to add code for some new feature.
- Shared libraries that contain code that isn’t really a good candidate for sharing. Trying to include that code in different products typically leads to a snarled code in the library with lots of specific conditions or APIs that haven’t made a decision on what their clients actually should use them for. Sometimes, the library will contain some code that is perfect for sharing, and some that definitely isn’t.
- A poor dependency structure: a library that is perfect for sharing might have a dependency on one that you would prefer to leave as product-specific, or there might be circular dependencies between libraries.
Once a library structure is in place and used by multiple teams, it is hard to change because a) it involves making many backwards-incompatible changes, b) it is work that is boring and difficult for developers, c) it gives no short-term value for business owners, and d) it requires cross-team schedule coordination. However, it’s one of those things where the longer you leave it, the higher the accumulated costs in terms of confusion and lack of synergies, so if you’re reasonably sure that the code in question will live for a long time, it’s is likely to eventually be a worthwhile investment to make. It is possible to do at least some parts of a library restructuring incrementally, although there are in my experiences some cases where you’ll bring a few people to a total standstill for a month or so while cleaning up some particular mess. That sort of situation is of course particularly hard to get resolved. It requires discipline, coordination, and above all, a clear understanding of the reasons for making the change – those reasons cannot be as simple as ‘the code needs to be clean’, there should be a cost-benefit analysis of some kind if you want to be really professional about it. If you invest a man-month in cleaning up your library structure, how long will it take before you’ve recouped that man-month?
Given that it is possible to go wrong with your shared library structure, what are the characteristics of a library structure that has gone right? Here’s a couple more bullet points and a diagram:
- Libraries are coherent and of the right size. There are some opposing forces that affect what is “the right library size”: smaller libraries are awkward to work with from a source management, information management and documentation perspective since the smaller the libraries are, the more of them you need. This means you’ll have more or more complicated IDE projects and build files, and you’ll have a larger set of things to search in order to find out where a particular feature is implemented. On the other hand, larger libraries suffer from a lack of purpose and coherence, which makes conflicts between teams sharing them more likely, increases their rate of change, and makes it harder to describe what they exist to do. All these things make them less suitable for sharing. I think you want to have libraries that are as fine-grained as you can make them, without making it too hard to get an overview over which libraries are available, what they are used for and where to find the code you want to make a change to.
- There’s a clear definition of what type of information and logic goes where in the layered structure. At the bottom, you’ll find super-generic things like logging and monitoring code. Slightly higher up, you’ll typically have things that are very central to the business: normally anything that relates to money or customers, where you’ll want to ensure that all products work the same way. Further upwards, you can find things that are shared within a given product category, and if you have higher level libraries, they are quite likely to be product-specific, so maybe not candidates for sharing at all.
- In addition to the horizontal structure defined by the layers, there is a coarse-grained vertical structure provided by services. Services group together related functions and are usually primarily introduced for operational reasons – for instance, the need to scale a certain set of features independently of some other set. But they also add to architectural clarity in that they provide a simpler view on their function set, allowing products to share implementations of certain features without being linked using the same code. Services also simplify code sharing in the way that they provide isolation: you can have separate teams develop services and clients in parallel as long as you have a sufficiently well-defined service API.
Structuring your code in a way that is conducive to sharing is a good thing, but it is also hard. I particularly struggle with the fact that it is very hard to be agile about it: you can’t easily “inspect and adapt”, because of the difficulty of changing a library structure once it is in place. The best opportunity to put a good structure in place is when you start sharing some code between two products, but at that time it is very hard to foresee future developments (how is product number three going to be similar to or different from products 1 and 2?). Defining a coarse layered structure based on expected ‘genericness’ and making libraries small enough to be coherent is probably the best way to get the structure approximately right.
We’re right now in the middle of something that feels a little like an unplanned experiment in code management. Unplanned in the sense that I didn’t expect us to work the way we do, not so much in the sense that I think it is particularly risky. It started not quite a year ago with my colleague Mateusz suggesting that we should dedicate a fixed percentage of the story points in each sprint to what he called maintenance or technical stories. That soon crystallised into a backlog owned by me, where we schedule stories that are aimed at somehow improving our productivity as opposed to our product. So Paul, the ‘normal’ product owner, is in charge of the main backlog that deals with improving the products, and I am the owner of the maintenance backlog, where we improve the efficiency of how we work by improving processes, tools and removing technical debt. We’re aiming at spending about 15% of our time on technical backlog stories, and we more or less do.
Typical examples of stories that have gone onto our technical backlog are:
- A tool that allows QA to specify that outgoing service calls matching certain regular expressions should return mock data specified in a file rather actually calling the service. This makes us a lot more productive with regard to verifying site behaviour in certain hard-to-recreate and data-dependent cases.
- Improvements to our performance monitoring systems and tools that make it easier for us to figure out where we have performance problems when we do.
- Auditing and optimising the QA server allocation in order to speed up especially our automated test scripts.
- Various refactoring stories that clean up code where functional evolution has led to the original design no longer being suitable.
That has worked out pretty much as expected: we’ve gained benefits from the productivity improvements and we continue to spend 5-6 times more effort on money-making product improvements than on engineering driven platform-building.
The unexpected thing that has happened is that we’re heading towards a situation where we have different design generations that solve similar problems. As an example, the original pattern we used to create Spring MVC controllers has broken down, so we’ve come up with a new one that is better though not yet perfect. In order to have stories small enough to complete within one iteration, we’ve had to apply this pattern on a controller-by-controller basis – each refactoring has taken about 2 weeks of calendar time so far, except the first one which took about 4, so the effort isn’t trivial. There are nearly 40 controller implementations in our site code and about 95% of the incoming traffic is handled by 4 of those. We’re now at a stage where we’ve refactored three of those four controllers. Given that the other 35 or so controllers a) don’t serve a lot of traffic, so don’t have a lot of business value and b) don’t get changed a lot because the functions they provide aren’t ones that we need to innovate in, I don’t feel like refactoring all of them. In fact, the next refactoring story in the backlog is aimed at a different area in the site, where a repeated pattern has broken down in a similar way.
My initial gut reaction was that we should apply any new pattern for a commonly occurring situation across the board, then tackle the next similar situation. That keeps the code clean and makes it easy to find your way around. But the point of refactoring is that it must be an investment that you can recoup, and if we haven’t spent more than 4 hours working on a particular controller in the last year, what are the chances of ever recouping an investment of a man-week? We would probably have to keep using the same controller for at least 20 years, assuming the refactoring made us twice as productive, and that doesn’t seem likely to happen. Given that, it seems like the best option is to focus refactoring efforts where they give return on investment, which is those parts of the code that you do most of your work in.
I think we’re more or less permanently stuck with two different generations of controller patterns. But it will be interesting to see what will happen over the next year or two – can this super-pattern of having annual growth rings of standardised solution patterns handle more than two generations? I’m not sure, but I believe it can.
Maven’s slow progress towards becoming the most accepted Java build tool seems to continue, although a lot of people are still annoyed enough with its numerous warts to prefer Ant or something else. My personal opinion is that Maven is the best build solution for Java programs that is out there, and as somebody said – I’ve been trying to find the quote, but I can’t seem to locate it – when an Ant build is complicated, you blame yourself for writing a bad build.xml, but when it is hard to get Maven to behave, you blame Maven. With Ant, you program it, so any problems are clearly due to a poorly structured program. With Maven you don’t tell it how to do things, you try to tell it what should be done, so any problems feel like the fault of the tool. The thing is, though, that Maven tries to take much more responsibility for some of the issues that lead to complex build scripts than something like Ant does.
I’ve certainly spent a lot of time cursing poorly written build scripts for Ant and other tools, and I’ve also spent a lot of time cursing Maven when it doesn’t do what I want it to. But the latter time is decreasing as Maven keeps improving as a build tool. There’s been lots of attempts to create other tools that are supposed to make builds easier than Maven, but from what I have seen, nothing has yet really succeeded to provide a clearly better option (I’ve looked at Buildr and Raven, for instance). I think the truth is simply that the build process for a large system is a complex problem to solve, so one cannot expect it to be free of hassles. Maven is the best tool out there for the moment, but will surely be replaced by something better at some point.
So, using Maven isn’t going to be problem-free. But it can help with a lot of things, particularly in the context of sharing code between multiple teams. The obvious thing it helps with is the single benefit that most people agree that Maven has – its way of managing dependencies and the massive repository infrastructure and dependency database that is just available out there. On top of that, building Maven projects in Hudson is dead easy, and there’s a whole slew of really nice tools that come with Maven plugins that you can use that enable you to get all kinds of reports and metadata about your code. My current favourite is Sonar, which is great if you want to keep track of how your code base evolves from some kind of aggregated perspective.
Here are some things you’ll want to do if you decide to use Maven for the various projects that make up your system:
- Use Nexus as an internal repository for build artifacts.
- Use the Maven Release plugin to create releases of internal artifacts.
- Create a shared POM for the whole code base where you can define shared settings for your builds.
The word ‘repository’ is a little overloaded in Maven, so it may be confusing. Here’s a diagram that explains the concept and shows some of the things that a repository manager like Nexus can help you with:
The setup includes a Git server (because you use Git) for source control, a Hudson server (or set of) that does continuous integration, a Nexus-managed artifact repository and a developer machine. The Nexus server has three repositories in it: internal releases, internal snapshots and a cache of external repositories. The latter is only there as a performance improvement. The other two are the way that you distribute Maven artifacts within your organisation. When a Maven build runs on the Hudson or developer machines, Maven will use artifacts from the local repository on the machine – by default located in a folder under the user’s home directory. If a released version of an artifact isn’t present in the local repository, it will be downloaded from Nexus, and snapshot versions will periodically be refreshed, even if present locally. In the example setup, new snapshots are typically deployed to the Nexus repository by the Hudson server, and released versions are typically deployed by the developer producing the release. Note that both Hudson and developers are likely to install snapshots to the local repository.
I’ve tried a couple of other repository managers (Archiva, Artifactory and Maven-Proxy), but Nexus has been by a pretty wide margin the best – robust, easy to use and easy to understand. It’s been a year or two since I looked at the other ones, so they may have improved since.
Having an internal repository opens up for code sharing by providing a uniform mechanism for distributing updated versions of internal libraries using the standard Maven deploy command. Maven has two types of artifact versions: releases and snapshots. Releases are assumed to be immutable and snapshots mutable, so updating a snapshot in the internal repository will affect any build that downloads the updated snapshot, whereas releases are supposed to be deployed to the internal repository once only – any subsequent deployments should deploy something that is identical. Snapshots are tricky, especially when branching. If you create two branches of the same library and fail to ensure that the two branches have different snapshot versions, the two branches will interfere,
There is interference between the two branches because they both create updates to the same artifact in the Maven repositories. Depending on the ordering of these updates, builds may succeed or fail seemingly at random. At Shopzilla, we typically solve this problem in two ways: for some shared projects, where we have long-lived/permanent team-specific branches, the team name is included in the version number of the artifact, and for short-lived user story branches, the story ID is included in the version number. So if I need to create a branch off of version 2.3-SNAPSHOT for story S3765, I’ll typically label the branch S3765 and change the version of the Maven artifact to 2.3-S3765-SNAPSHOT. The Maven release plugin has a command that simplifies branching, but for whatever reason, I never seem to use it. Either way, being careful about managing branches and Maven versions is necessary.
A situation where I do use the maven release plugin a lot is when making releases of shared libraries. I advocate a workflow where you make a new release of your top-level project every time you make a live update, and because you want to make live updates frequently and you use scrum, that means a new Maven release with every iteration. To make a Maven release of a project, you have to eliminate all snapshot dependencies – this is a necessary requirement for immutability – so releasing the top level project means make release versions of all its updated dependencies. Doing this frequently reduces the risk of interference between teams by shortening the ‘checkout, modify, checkin’ cycle.
See the pom file example below for some hands-on pom.xml settings that are needed to enable using the release plugin.
The final tip for code sharing using Maven that I wanted to give is to use a shared parent POM that contains settings that should be shared between projects. The main reason is of course to reduce code duplication – any build file is code, of course, and Maven build files are not as easy to understand as one would like, so simplifying them is very valuable. Here’s some stuff that I think should go into a shared pom.xml file:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.mycompany</groupId> <artifactId>shared-pom</artifactId> <name>Company Shared pom</name> <version>1.0-SNAPSHOT</version> <packaging>pom</packaging> <!-- One of the things that is necessary in order to be able to use the release plugin is to specify the scm/developerConnection element. I usually also specify the plain connection, although I think that is only used for generating project documentation, a Maven feature I don't find particularly useful personally. A section like this needs to be present in every project for which you want to be able to use the release plugin, with the project- specific Git URL. --> <scm> <connection>scm:git:git://GITHOST/GITPROJECT</connection> <developerConnection>scm:git:git://GITHOST/GITPROJECT</developerConnection> </scm> <build> <!-- Use the plugins section to define Maven plugin configurations that you want to share between all projects. --> <plugins> <!-- Compiler settings that are typically going to be identical in all projects. With a name like Måhlén, you get particularly sensitive to using the only useful character encoding there is.. ;) --> <plugin> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.6</source> <target>1.6</target> <encoding>UTF-8</encoding> </configuration> </plugin> <!-- Tell Maven to create a source bundle artifact during the package phase. This is extremely useful when sharing code, as the act of sharing means you'll want to create a relatively large number of smallish artifacts, so creating IDE projects that refer directly to the source code is unmanageable. But the Maven integration of a good IDE will fetch the Maven source bundle if available, so if you navigate to a class that is included via Maven from your top-level project, you'll still see the source version - and even the right source version, because you'll get what corresponds to the binary that has been linked. --> <plugin> <artifactId>maven-source-plugin</artifactId> <executions> <execution> <phase>package</phase> <goals> <goal>jar</goal> </goals> </execution> </executions> </plugin> <!-- Ensure that a javadoc jar is being generated and deployed. This is useful for similar reasons as source bundle generation, although to a lesser degree in my opinion. Javadoc is great, but the source is always up to date. --> <plugin> <artifactId>maven-javadoc-plugin</artifactId> <executions> <execution> <phase>package</phase> <goals> <goal>jar</goal> </goals> </execution> </executions> </plugin> <!-- The below configuration information was necessary to ensure that you can use the maven release plugin with Git as a version control system. The exact version numbers that you want to use are likely to have changed since then, and it may even be that Git support is more closely integrated nowadays, so less explicit configuration is needed - I haven't tested that since maybe March 2009. --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-release-plugin</artifactId> <dependencies> <dependency> <groupId>org.apache.maven.scm</groupId> <artifactId>maven-scm-provider-gitexe</artifactId> <version>1.1</version> </dependency> <dependency> <groupId>org.codehaus.plexus</groupId> <artifactId>plexus-utils</artifactId> <version>1.5.7</version> </dependency> </dependencies> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-scm-plugin</artifactId> <version>1.1</version> <dependencies> <dependency> <groupId>org.apache.maven.scm</groupId> <artifactId>maven-scm-provider-gitexe</artifactId> <version>1.1</version> </dependency> <dependency> <groupId>org.codehaus.plexus</groupId> <artifactId>plexus-utils</artifactId> <version>1.5.7</version> </dependency> </dependencies> </plugin> </plugins> </build> <!-- Configuration of internal repositories so that the sub-projects know where to download internally created artifacts from. Note that due to a bootstrapping issue, this configuration needs to be duplicated in individual projects. This file, the shared POM, is available from the Nexus repo, but if the project POM doesn't contain the repo config, the project build won't know where to download the shared POM. --> <repositories> <!-- internal Nexus repository for released artifacts --> <repository> <id>internal-releases</id> <url>http://NEXUSHOST/nexus/content/repositories/internal-releases</url> <releases><enabled>true</enabled></releases> <snapshots><enabled>false</enabled></snapshots> </repository> <!-- internal Nexus repository for SNAPSHOT artifacts --> <repository> <id>internal-snapshots</id> <url>http://NEXUSHOST/nexus/content/repositories/internal-snapshots</url> <releases><enabled>false</enabled></releases> <snapshots><enabled>true</enabled></snapshots> </repository> <!-- Nexus repository cache for third party repositories such as ibiblio. This is not necessary, but is likely to be a performance improvement for your builds. --> <repository> <id>3rd party</id> <url>http://NEXUSHOST/nexus/content/repositories/thirdparty/</url> <releases><enabled>true</enabled></releases> <snapshots><enabled>false</enabled></snapshots> </repository> </repositories> <distributionManagement> <!-- Defines where to deploy released artifacts to --> <repository> <id>internal-repository-releases</id> <name>Internal release repository</name> <url>URL TO NEXUS RELEASES REPOSITORY</url> </repository> <!-- Defines where to deploy artifact snapshot to --> <snapshotRepository> <id>internal-repository-snapshot</id> <name>Internal snapshot repository</name> <url>URL TO NEXUS SNAPSHOTS REPOSITORY</url> </snapshotRepository> </distributionManagement> </project>
The less pleasant part of using Maven is that you’ll need to learn more about Maven’s internals than you’d probably like, and you’ll most likely stop trying to fix your builds not when you’ve understood the problem and solved it in the way you know is correct, but when you’ve arrived at a configuration that works through trial and error (as you can see from my comments in the example pom.xml above). The benefits you’ll get in terms of simplifying the management of build artifacts across teams and actually also simplifying the builds themselves outweigh the costs of the occasional hiccup, though. A typical top-level project at Shopzilla links in around 70 internal artifacts through various transitive dependencies – managing that number of dependencies is not easy unless you have a good tool to support you, and dependency management is where Maven shines.
(This is item 3 in the code sharing cookbook)
Today, no CV comes along without “Scrum” and “Agile” on it and Scrum has gained acceptance so quickly that it should definitely set off any hype-warning systems. I personally think there’s a lot more to Scrum than just a hype – if you do it right, it’s extremely useful for productivity. There are some aspects of Scrum done right that are particularly valuable for sharing code:
- Empowering teams to deliver end-to-end functionality.
- (Relatively) short iterations.
- Delivering potentially shippable results with every sprint.
The biggest one is empowering teams to deliver fully functioning features. The way I think that affects code sharing is that rather than having your organisation focus on developing technology horizontals, you aim at developing feature verticals.
As much as feasible, Scrum tells you to have teams whose focus is to deliver the blue verticals in the above diagram. The red horizontals will be developed and maintained as a consequence of needs driven by the products. At the other end of the spectrum, your teams are aligned along the libraries or technology components, with each team responsible for one or more services or libraries. Obviously, you can vary your focus from totally red to totally blue, and you get different advantages and disadvantages depending on where you are. At the ‘red’ end of the spectrum, with teams very much aligned along technical component lines, you get the following advantages and disadvantages:
- Teams get a very deep knowledge of their components leading to solid, strong technology.
- Teams get a strong feeling of ownership of components, and can take a long view when developing them.
- You need to align product roadmaps for different products so that they can take advantage of features and changes made in the underlying libraries.
- You will get queues and blockages in the product development, where one team is waiting for the result of another team’s work, or where one team has finished its work and the result “sits on a shelf” until the downstream team is ready to pick it up. (This is what Lean tells you to avoid.)
- Most teams are not customer-facing, meaning that their priorities tend to shift from what is important to the business to what is important to their component. This in turn increases the risk of developing technology for its own sake rather than due to a business need.
At the ‘blue’ end, on the other hand, shared code is collectively owned by multiple teams and you get a situation where:
- Teams rarely need to wait for others in order to get their features finished and launched. This is great for innovation.
- The fact that there is never any waiting and the teams are typically in control of their own destiny is energising: nobody else is to blame for failures or gets credit for successes. Work done leads to something visible. This makes it more rewarding, fun and efficient.
- Features and changes that are developed are typically well aligned with business priorities.
- There is a real risk of under-investing in shared technology. Larger restructuring tasks may never happen because no single team owns the responsibility for technology components.
- There is no roadmap for individual components, which can lead to bloat and sprawl.
- The lack of continuity increases the risk that it is unclear how some feature was intended to work and the reasons why it was implemented in a certain way are more likely to be forgotten.
- Developers (by which I don’t mean just programmers, but all team members) need to be jacks-of-all-trades and risk being masters of none.
I think that the best solution in general is to get redder the deeper you go in the technology stack, because that’s where you have more complex and general technologies that a) need deep knowledge to develop and b) typically don’t affect end-user functionality very directly. I also think that most organisations I’ve seen have been too red. Having product-specific teams that are allowed to make modifications to much of the shared codebase allows you to develop your business-driving products quickly and based on their individual needs. So there should be more focus on products and teams that can develop features end-to-end, just as Scrum tells us!
The next thing that makes Scrum great for sharing code is the combination of time-limited sprints and the focus on delivering finished code at each iteration. The ‘deliver potentially shippable code’ bit seems to be one of the hardest things about Scrum, even though it is actually quite trivial as a concept: if you don’t feel like you could launch the code demoed at the end of the sprint the day after, don’t call the story done, and don’t grant yourself any story points for it. That way, you’ll not be able to let anything out of your sights until it is shippable and your velocity will be reduced until you’re great at getting things really ready – which is exactly right! If it is difficult for your team to take things all the way to potentially shippable because of environmental or process problems, then fix those issues until it is easy.
Actually finishing things is great for productivity in general, but it is even better in a code-sharing context. To illustrate how, I’ll use one of my favourite diagrams:
Assume you have three features to complete, each of which requires three tasks to be done. Each of the tasks takes one day, and you can only do one task at a time. If, as above, you start each feature as early as possible, all the code that is touched by feature A is in a ‘being modified’ state for 7 days (from the start of A1 to the end of A3), and the same applies to features B and C. In the second version below, the corresponding time is 3 days per feature, meaning that the risk that another team will need to make modifications concurrently is less than half of the first version.
Also, having any updates to shared code completed quickly means that the time between merge opportunities is decreased, which decreases the risks of isolation. You really want to shorten the branch – modify – merge back cycle as much as possible, and Scrum’s insistence on getting things production-ready within the time period of a single sprint (which is supposed to be less than 4 weeks – I’ve found that 2 or 3 weeks seems to work even better in most situations) is a great support in pushing for a short such cycle.
I think Scrum is great in general, and done right, it also helps you with sharing code. Jeff Sutherland said in a presentation I attended that the requirement to produce potentially shippable code with each iteration is the hardest requirement in the Nokia test. I can reluctantly understand why that’s the case. It’s not conceptually hard and unlike some classes of technical problem, it doesn’t require any exceptional talent to succeed with. What makes it hard is that it requires discipline and an environment where it is OK to flag up problems without management seeing that as being obstructive. It’s worth doing, so don’t allow yourself any shortcuts, and fix every process or environment obstacle that stands in the way of producing shippable code with each sprint. Combine that with teams that are empowered to own and modify pretty much all the code that makes up their product, and you’ve come a long way towards a great environment for code sharing.
(This is item 2 in the code sharing cookbook)
Joel Spolsky used what seems to be his last blog post to talk about Git and Mercurial. I like his description of their main benefit as being that they track changes rather than revisions, and like him, I don’t particularly like the classification of them as distributed version control systems. As I’ve mentioned before, the ‘distributed’ bit isn’t what makes them great. In this post, I’ll try to explain why I think that Git is a great VCS, especially for sharing code between multiple teams – I’ve never used Mercurial, so I can’t have any opinions on it. I will use SVN as the counter-example of an older version control system, but I think that in most of the places where I mention SVN, it could be replaced by any other ‘centralised’ VCS.
The by far biggest reason to use Git when sharing code is its support for branching and merging. The main issue at work here is the conflict between two needs: teams need to have complete control of their code and environments in order to be effective in developing their features, and the overall need to detect and resolve conflicting changes as quickly as possible. I’ll probably have to explain a little more clearly what I mean by that.
Assume that Team Red and Team Blue are both working on the same shared library. If they push their changes to the exact same central location, they are likely to interfere with each other. Builds will break, bugs will be introduced in parts of the code supposedly not touched, larger changes may be impossible to make and there will be schedule conflicts – what if Team Blue commits a large and broken change the day before Team Red is going to release? So you clearly want to isolate teams from each other.
On the other hand, the longer the two teams’ changes are isolated, the harder it is to find and resolve conflicting changes. Both volume of change and calendar time are important here. If the volume of changes made in isolation is large and the code doesn’t work after a merge, the volume of code to search in order to figure out the problem is large. This of course makes it a lot harder to figure out where the problem is and how to solve it. On top of that, if a large volume of code has been changed since the last merge, the risk that a lot of code has been built on top of a faulty foundation is higher, which means that you may have wasted a lot of effort on something you’ll need to rewrite.
To explain how long calendar time periods between merges are a problem, imagine that it takes maybe a couple of months before a conflict between changes is detected. At this time, the persons who were making the conflicting changes may no longer remember exactly how the features were supposed to work, so resolving the conflicts will be more complicated. In some cases, they may be working in a totally different team or even have left the company. If the code is complicated, the time when you want to detect and fix the problem is right when you’re in the middle of making the change, not even a week or two afterwards. Branches represent risk and untestable potential errors.
So there is a spectrum between zero isolation and total isolation, and it is clear that the extremes are not where you want to be. That’s very normal and means you have a curve looking something like this:
You have a cost due to team interference that is high with no isolation and is reduced by introducing isolation, and you have a corresponding cost due to the isolation itself that goes up as you isolate teams more. Obviously the exact shape of the curves is different in different situations, but in general you want to be at some point between the extremes, close to the optimum, where teams are isolated enough for comfort, yet merges happen soon enough to not allow the conflict troubles to grow too large.
So how does all that relate to Git? Well, Git enables you to fine-tune your processes on the X axis in this diagram by making merges so cheap that you can do them as often as you like, and through its various features that make it easier to deal with multiple branches (cherry-picking, the ability to identify whether or not a particular commit has gone into a branch, etc.). With SVN, for instance, the costs incurred by frequent merges are prohibitive, partly because making a single merge is harder than with Git, but probably even more because SVN can only tell that there is a difference between two branches, not where the difference comes from. This means that you cannot easily do intermediate merges, where you update a story branch with changes made on the more stable master branch in order to reduce the time and volume of change between merges.
At every SVN merge, you have to go through all the differences between the branches, whereas Git’s commit history for each branch allows you to remember choices you made about certain changes, greatly simplifying each merge. So during the second merge, at commit number 5 in the master branch, you’ll only need to figure out how to deal with (non-conflicting) changes in commits 4 and 5, and during the final merge, you only need to worry about commits 6 and 7. In all, this means that with SVN, you’re forced closer to the ‘total isolation’ extreme than you would probably want to be.
Working with Git has actually totally changed the way I think about branches – I used to say something along the lines of ‘only branch in extreme situations’. Now I think having branches is a good, normal state of being. But the old fears about branching are not entirely invalidated by Git. You still need to be very disciplined about how you use branches, and for me, the main reason is that you want to be able to quickly detect conflicts between them. So I think that branches should be short-lived, and if that isn’t feasible, that relatively frequent intermediate merges should be done. At Shopzilla, we’ve evolved a de facto branching policy over a year of using Git, and it seems to work quite well:
- Shared code with a low rate of change: a single master branch. Changes to these libraries and services are rare enough that two teams almost never make them at the same time. When they do, the second team that needs to make changes to a new release of the library creates a story branch and the two teams coordinate about how to handle merging and releasing.
- Shared code with a high rate of change: semi-permanent team-specific branches and one team has the task of coordinating releases. The teams that work on their different stories/features merge their code with the latest ‘master’ version and tell the release team which commits to pick up. The release team does the merge and update of the release branch and both teams do regression QA on the final code before release. This happens every week for our biggest site.
- Team-specific code: the practice in each team varies but I believe most teams follow similar processes. In my team, we have two permanent branches that interleave frequently: release and master, and more short-lived branches that we create on an ad-hoc basis. We do almost all of our work on the master branch. When we’re starting to prepare a release (typically every 2-3 weeks or so), we split off the release branch and do the final work on the stories to be released there. Work on stories that didn’t make it into the release goes onto the master branch as usual. It is common that we have stories that we put on story-specific branches, when we don’t believe that they will make it into the next planned release and thus shouldn’t be on master.
The diagram above shows a pretty typical state of branches for our team. Starting from the left, the work has been done on the master branch. We then split off the release branch and finalise a release there. The build that goes live will be the last one before merging release back into master. In the mean time, we started some new work on the master branch, plus two stories that we know or believe we won’t be able to finish before the next release, so they live in separate branches. For Story A, we wanted to update it with changes made on the release and master branch, so we merged them into Story A shortly before it was finished. At the time the snapshot is taken, we’ve started preparing the next release and the Story A branch has been deleted as it has been merged back into master and is no longer in use. This means that we only have three branches pointing to commits as indicated by the blueish markers.
This blog post is now far longer than I had anticipated, so I’m going to have to cut the next two advantages of Git shorter than I had planned. Maybe I’ll get back to them later. For now, suffice it to say that Git allows you to do great magic in order to fix mistakes that you make and even extracting and combining code from different repositories with full history. I remember watching Linus Torvalds’ Tech Talk about Git and that he said that the performance of Git was such that it led to a quantum change in how he worked. For me working with Git has also led to a radical shift in how I work and how I look at code management, but it’s not actually the performance that is the main thing, it is the whole conceptual model with tracking commits that makes branching and merging so easy that has led to the shift for me. That Git is also a thousand times (that may not be strictly true…) faster than SVN is of course not a bad thing either.
(This is item 1 in the code sharing cookbook)
Since shared code leads to free features, one might think that more sharing is always better. That is actually not true. Sharing code or technology between products has some very obvious benefits and some much less obvious costs. The sneakiness of the costs leads to underestimating them, which in turn can lead to broken attempts at sharing things. I’ll try to give my picture of what characterises things that are suitable for sharing and how to think about what not to share in this post. Note that the perspective I have is based on an organisation whose products are some kind of service (I’ve mostly been developing consumer-oriented web services for the last few years) as opposed to shrink-wrapped products, so probably a lot of what I say isn’t applicable everywhere.
These are some of the main reasons why you want to share code:
- You get features for free – this is almost always the original reason why you end up having some shared code between different products. Product A is out there, and somebody realises that there is an opportunity to create product B which has some similarities with A. The fastest and cheapest way to get B out and try it is to build it on A, so let’s do that.
- You get bug fixes for free – of course, if product A and B share code and a bug is fixed for product A, when B starts using the fixed version of the shared code, it is fixed for B as well.
- Guaranteed consistent behaviour between products in crucial functional areas. This is typically important for backoffice-type functions, where, for instance, you want to make sure that all your products feed data into the data warehouse in a consistent way so the analysts can actually figure out how the products are doing using the same tools.
- Using proven solutions and minimising risk. Freshly baked code is more likely to have bugs in it than stuff that has been around for a while.
- Similarity of technology can typically reduce operational costs. The same skill sets and tools can be used to run most or all of your products and you can share expensive environments for performance testing, etc. This also has the effect of making it easier for staff to move between products as there is less new stuff to learn in order to get productive with your second product.
All of those reasons are very powerful and typically valid. But they need to be contrasted against some of the costs that you incur from sharing code:
- More communication overhead and slower decision making. To change a piece of code, one needs to talk to many people to ensure that it doesn’t break their planned or existing functionality. Decisions about architecture may require days instead of minutes due to the need to coordinate multiple teams.
- More complicated code. Code that needs to support a single product with a single way of doing things can be simpler than code that has to support multiple products with slight variations on how they do things. This complexity tends to increase over time. Also, every change has to be made with backwards compatibility in mind, which adds additional difficulties to working with the code.
- More configuration management overhead. Managing different branches and dependencies between different shared libraries is time consuming, as are merges when they have to happen. Similarly, you need to be good at keeping track of which versions of shared libraries are used for a particular build.
- More complex projects, especially when certain pieces of shared technology can only be modified by certain people or teams. If there are two product teams (A and B) and a team that delivers shared functionality (let’s call them ‘core’), the core team’s backlog needs to be prioritised based on both the needs of A and B. Also, both team A and B are likely to end up blocked waiting for changes to be made by the core team – during times like that, people don’t typically become idle, they just work on other things than what is really important, leading to reduced productivity and a lack of the ‘we can do anything’ energy that characterises a project that runs really well.
- More mistakes – all the above things lead to a larger number of mistakes being made, which costs in terms of frustration and time taken to develop new features.
The problem with the costs is that they are insidious and sneak up on you as you share more and more stuff between more products, whereas the benefits are there from day 1 – especially on day 1, when you release your second product which is ‘almost the same’ as the first one and you want as many free features as you can get.
So sharing code can give you lots of important or even vital benefits, but done wrong, it can also make your organisation into a slow-moving behemoth stuck in quicksand due to the dependencies it creates between the products and the teams that should develop them. The diagram below shows how while shared libraries can be building blocks for constructing products, they also create ties between the products. These ties will need management to prevent them from binding your arms behind your back.
The way I think about it, products represent things that you make money from, so changing your products should lead to making more money. Shared code makes it possible for you to develop your products at a lower cost or with a lower risk, but reduces the freedom to innovate on the product level. So in the diagram, the blue verticals represent a money-making perspective and the red horizontals a cost-saving perspective. I guess there is something smart to be said about when in a product’s or organisation’s lifecycle one is more important than the other, but I don’t really know what – probably more mature organisations or products need more sharing and less freedom to develop?
Anyway, getting back to the core message of this post: I’m saying that even if code is identical between two products, that doesn’t necessarily mean that it should be shared. The reason is that it may well start out identical, but if it is likely to be something that one or both product teams will want to modify in the future to maximise their chances of making money, both products will be slowed down by having to worry about what their changes are doing to the other team. Good candidates for sharing tend to have:
- A low rate of change – so the functionality is mature and not something you need to tweak frequently in order to tune your business or add features to your product.
- A tight coupling to other parts of the company’s ecosystem – reporting/invoicing systems, etc. This usually means tight integration into business processes that are hard to change.
- A high degree of generality – the extreme examples of such general systems are of course things like java.util.Set or log4j. Within a company, you can often find things that are very generic in the context of the business.
Of course, those three factors are related. I have found that simply looking at the first one by checking the average number of commits over some period of time gives a really good indication. If there are many changes, don’t share the code. If there are few, you might want to share it. I think the reason why it works is partly that rate of change is a very good indicator of generality and partly because if you try to share something that changes a lot, you’ll incur the costs of sharing very frequently.
Sharing is great, but it’s definitely not something that should be maximised at all costs. Any sharing of functionality between two products creates dependencies between them in the form of feature interaction that adds to the cost of features and schedule interaction between projects/teams that get blocked by each other due to changes to something that is shared.
It is often useful to think of sharing at different levels: maybe it isn’t a great idea to create a shared library from some code that you believe will be modified frequently by two different projects. As an alternative, you can still gain free features by copying and pasting that code from one product to the other, and then letting the two versions lead their own lives. So share code, but be smart about it and don’t share everything!
If you’re in an organisation that grows or whose business is changing, you’ll soon want to add another product to the one or ones you’ve already got. Frequently, the new product idea has a lot of similarity to existing ones (because you tend to both come up with ideas in the space where you work, and because you’ll tend to want to play to your existing strengths), so there is a strong desire to reuse technology. First, to get the new product out, at least as a prototype, and later, assuming it is successful, continuing to share code in order to not have to reinvent the wheel.
Code sharing makes a lot of sense, but it is in fact a lot harder than it seems on the face of it. In what is quite possibly a more ambitious project than I will have the tenacity to complete, I’m going to try to set out some ideas on how to do code sharing in the kind of organisation that I have recent experience of: around 30 developers working on around 5 different products. The first one will be a bit theoretical, but the rest should be quite concrete with hands-on tips about how to do things.
Here’s the list of topics I’ve got planned:
- Share Code Selectively.
- Use Git.
- Use Scrum.
- Use Maven
- Use JUnit.
- Use Hudson.
- Divide and Conquer.
- Manage Dependencies.
Over the next few weeks or months, I’ll try to write something more detailed about each of them. I would be surprised if I don’t have to go back to this post and update it based on the fact that my thinking around this will probably change as I write the posts.
The fact that Git doesn’t require a central repository but can function as a completely distributed system for version control is often touted as its main benefit. People who prefer having a central repository are sometimes told they just don’t get it. I think that the distributed nature of Git might be a beneficial thing for some teams of people, but I don’t see that it works very well for all situations. Maybe I don’t get it (but of course I think I do :)).
Here’s my concrete example of when I think distributed version control is a bad choice. At work, we have 5-6 teams of developers working on part of a fairly large service-oriented system. These teams work on around 50 different source projects that contain the source code for various services and shared libraries. We’re using Scrum, which means that we want to ensure every team is empowered to make the changes it needs to fully implement a front-end feature. This means that every team has the right to make modifications to any one of these 50 projects. There are of course limits to the freedom – the total number of source projects at Shopzilla is over 100 (maybe closer to 200, I never counted), and the remaining source code is outside of the things that me and my colleagues in other teams are working on. But the subset of our systems I am using as an example consists of probably 50 projects that are worked on by around 20 developers in 5 teams with different goals.
As if this picture wasn’t complicated enough, each team typically has a sprint length of 2 weeks after which finished features go live (although one team does releases every week). Some features are larger than what we can finish in one sprint, so we normally have a few branches for the main project and a couple of libraries that are affected by a large feature. So we’ll have the release branch in case we need a hot fix, master for things going into the next release, and 0-2 story branches for longer stories – per team. Naturally the rate of change varies a lot; some libraries are very stable meaning that they will only have one branch, while the top-level projects (typically corresponding to a shopping comparison site) change all the time and are likely to have at least three branches concurrently being worked on.
Of course, in a situation with so many moving parts, you need to be able to:
- reliably manage what dependencies go into what builds,
- reliably manage what builds are used where,
- detect conflicts between changes and teams quickly, and
- detect regression bugs quickly and reliably.
All that requires quick and easy information exchange between teams about code changes. If we would use distributed version control, we would have to figure out how to consolidate differences between 25-30 developers’ repositories. That is hard and would take a lot of time away from actual coding not to mention adding lots of mistakes due to broken dependencies and configuration management mistakes.
When I say hard, I mean that partly in a mathematical sense. Fully distributed version control is actually an example of a complete graph. That means that the number of possible/necessary communication links between different repositories increases quickly as the number of repositories increases (it is n*(n-1)/2 for those who want the full formula). So, for two people, you only have one link that needs to be updated. If there are three repositories, you need two links, and four repositories leads to six links. When you hit 10 repositories, there are 45 possible ways that code can be pulled, and at 20, 190. This large number of possible ways to get code updates (and get them wrong) is unmanageable. As far as I understand, the practice in teams using distributed source control is to have “code lieutenants” whose role is essentially to divide the big graph of developer repositories into smaller sub-graphs, thereby reducing the number of links.
In the case above, there are 3 * 5(5 – 1)/2 + 3(3-1) /2 = 33 connections between 15 developers instead of 15(15-1) / 2 = 105.
Now, the way it looks with a centralised system is much simpler as the number of developers increases. This corresponds to a star graph, where the number of connections is equal to the number of developers.
I guess it is conceivable that with say 25 developers you could handle the number of connections required to manage the source code for a few repositories even with distributed code management. But when you have a situation with 50 source projects and a number of branches, I think you need to do everything in your power to remove necessary interaction paths.
Even supposing that the communication problems are less serious than I think they are, I’m unsure of the advantages to using a distributed version control system in a corporate setting. Well, why beat about the bush: I don’t see any. Wikipedia has an article listing some more or less dubious advantages of distributed source control, where many of them relate to an implementation detail of Git rather than distributed source control as such (the use of the local repository that saves a lot of network overhead and speeds up many commands). For me, the main advantage of centralised source control in a corporate setting is that you really want to be able to quickly and reliably confirm that teams that are supposed to be working together actually are. Centralised source control lends itself to continuous integration with heavy usage of automated testing to detect conflicting changes that lead to build failures or regression bugs. Distributed source control doesn’t.
We’re using Git at Shopzilla, and I’m loving it. For me, the main reason not to use Git is that it is hard to understand. I guess one obstacle to understanding Git is that its distributed nature is new and different, but I kind of feel that the importance of that is overstated and therefore it should be deemphasised. The other main hurdle, that it has “a completely baffling user interface that makes perfect sense IF YOU’RE A VULCAN” is a harder problem to get around. But if you work with a tool 8 hours a day, you’ll learn to use it even if it is complicated. What I love about Git isn’t that it is distributed, but that:
- It is super-fast.
- It’s architecture is fundamentally right – a collection of immutable objects identified by keys that are derived from the objects (the hash codes), with a set of labels (branches, tags, etc.) pointing into relevant places. This just works for a VCS.
- Its low-level nature and the correctness of the architecture means that when I make mistakes, I can always fix them. And there seems to be no way of getting a Git repository into a corrupt state even when you manually poke its internals.
- It excels at managing branches – I think the way we’re working at Shopzilla is right for us, and I think that if we had stayed with SVN, we couldn’t do it.
There was a recent comment to a bug I posted in the Maven Git SCM Provider that triggered some thoughts. The comment was:
“GIT is a distributed SCM. There IS NO CENTRAL repository. Accept it.
Doing a push during the release process is counter to the GIT model.”
In general, the discussions around that bug have been quite interesting and very different from what I expected when I posted it. My reason for calling it a bug was that an unqualified ‘push‘ tries to push everything in your local git repository to the origin repository. That can fail for some branch that you’ve not kept up to date even if it is a legal operation for the branch that you’re currently doing a release of. Typically, that other branch has moved a bit, so your version is a couple of commits behind. A push in that state will abort the maven release process and leave you with some pretty tricky cleaning up to do (edit: Marta has posted about how to fix that). A lot of people commenting on the bug have made comments about how Git is distributed and therefore push shouldn’t be done at all, or be made optional.
I think that the issue here is that there is an impedance mismatch between Git and Maven. While Git is a distributed version control system – that of course also supports a centralised model perfectly well – the Maven model is fundamentally a centralised one. This is one case where the two models conflict, and my opinion is that the push should indeed happen, just in a way that is less likely to break. The push should happen because when doing a Maven release, supporting Maven’s centralised model is more important than supporting Git’s distributed model.
The main reason why Maven needs to be centralised is the way that artifact versions are managed. If releasing can be done by different people from local repositories without any central coordination, there is a big risk of different people creating artifact versions that are not the same. The act of creating a Maven release is in fact saying that “This binary package is version 2.1 of this artifact, and it will never change”. There should never be two versions of 2.1. Git of course gets around this problem using hashes of the things it version controls instead of sequential numbers, and if two things are identical, they will have the same hash code = the same version number. Maven produces artifacts on a higher conceptual level, where sequential version numbers are important, so there needs to be a central location that determines what is the next version number to use and provides a ‘master’ copy of the published artifacts.
I’ve also thought a bit about centralised versus distributed version management and when the different choices might work, but I think I’ll leave that for another post at another time (EDIT: that time was now). Either way, I think that regardless of the virtues of distributed version management systems like Git, Maven artifacts need to be managed centrally. It would be interesting to think about what a distributed dependency management system would look like…