Posts Tagged Maven

A complement to Object Encapsulation

As usual, I’ve been thinking about Maven and how it’s not perfect. As usual, I really want to fix it, but have no time to actually do something. This time, the issue I found made me think about something that is kind of a complement to encapsulation, but I don’t know if there’s a proper term for it. I’ve asked 5-6 people who ought to know, but none of them has come up with something, so I’ve decided to call it self-sufficiency until somebody can tell me what it is really called. :)

Let’s start with the issue that got me thinking. We have started using Clover to measure code coverage of our tests, and the first take on a standardised POM file for building a set of new components led to some weird things happening. Such as executing the ‘javadoc:jar’ goal 12 times in a single build, and so on. I never figured out exactly how that happened, but I managed to track the problem down to the fact that the Clover2 plugin calls the ‘install’ phase before executing itself. Although I think that is a less than great thing to do, there’s probably a good reason why it needs to ensure that ‘install’ has been executed, and I don’t want to spend time on that particular issue. What’s interesting is that this is a symptom of a design flaw in Maven. Maven allows and almost encourages plugins to be aware of and manipulate the build in its entirety – by registering themselves with a specific lifecycle phase, by letting them pull information out of the global project structure and like in this case, by allowing them to manipulate the build flow.

This reaching out of one’s own space into a global space, where an object makes assumptions about what the world surrounding it looks like, is what I mean by the ‘complement of encapsulation’. An object that makes no such assumptions and does no such reaching out is self-sufficient.

To give a little more meat to the idea I’m trying to describe, here’s a list of related concepts and why they’re not the same:

  • Encapsulation – it’s probably incorrect to think of self-sufficiency as the complement of encapsulation. They’re certainly not each other’s opposites, and encapsulation isn’t enough to guarantee self-sufficiency. It is perfectly possible that in the Maven example above, there is a well-encapsulated object that manages the build cycle, which the plugin is calling. The concept of encapsulation is applicable at an object or class level, whereas the concept of self-sufficiency is more of an architectural concept – what kind of interactions you decide to allow between which (sets of) objects.
  • Dependency injection – one of the main points of dependency injection or inversion of control is that it encourages and enables self-sufficiency. Without it, objects always reach out into the surrounding world, making assumptions about what they will be able to find. But again, like encapsulation, DI works at a different level, and is not quite sufficient to get self-sufficiency.
  • Side effects – there are many different definitions of side effects, but with all I’ve seen, the concept of self-sufficiency is related but not identical. Side effects are usually considered be “something that I didn’t expect a method with name X to do”. It’s possible and not uncommon to have objects that are not self-sufficient but are side-effect-free.
  • Coupling – as I interpret the wikipedia definition, I would say that in order for a system to be loosely coupled, it must have self-sufficient objects. However, having self-sufficient objects isn’t enough to guarantee loose coupling – the most common application of the term relates to lower-level coupling between classes, making it harder or easier to swap in and out concrete implementations of collaborators. You can have self-sufficient objects that are strongly coupled to specific implementations of their collaborating objects rather than interfaces.
  • Law of Demeter – an object that violates the Law of Demeter is less self-sufficient than one that follows it. But again, the Law of Demeter is more of a class/object design principle, and the principle of self-sufficiency is an architectural one. You can violate the principle of self-sufficiency while keeping strictly to the Law of Demeter.
  • Layering – this is very closely related. Violating the principle of self-sufficiency means you’re bridging abstraction layers. Ideally, a Maven plugin should be at a layer below the main build itself (or above, depending on which direction you prefer – in this discussion, I’m saying lower layers cannot call up into higher layers). The main build should provide the plugin with everything it needs, and the plugin should execute in isolation and without worrying about what happens above it. Self-sufficiency is a narrower and slightly different concept than layering. It has opinions on where the layer boundaries should be located. In the Maven example, there is no abstraction layer between the build as a whole and the plugins, and self-sufficiency states that there should have been one.

I’m not sure self-sufficiency is a great term, so if somebody has an idea for a better one, please feel free to make suggestions! Here are some other terms I thought of:

  • Isolation – objects that are self-sufficient can be executed in isolation, independently of the context they’re running in. However, isolation would be overloaded (with primarily the I in ACID), and it’s also a little negative. I think the term should be positively charged, as the concept represents a good thing.
  • Introvert/Extrovert – an Extrovert object reaches out into the world and makes assumptions about it whereas an Introvert one has all it needs internally. The good thing about this pair is that it is a pair. The world in-self-sufficient doesn’t work, and neither does self-insufficient. But again, the way these terms are usually used, it’s better to be an extrovert than an introvert, which is the opposite of what the term should mean in this context.

If I ever do find the time to try to fix Maven, one of the things I’ll do is make sure plugins are self-sufficient – let the overall build flow be controlled in its entirety by one layer, and let the plugins execute in another layer, in complete ignorance of other plugins and phases of the build!

, ,

Leave a comment

Bygg – Executing the Build

This is the third post about Bygg – I’ve been fortunate enough to find some time to do some actual implementation, and I now have a version of the tool that can do the following:

First, read a configuration of plugins to use during the build:

public class PluginConfiguration {
  public static Plugins plugins() {
    return new Plugins() {
      public List plugins() {
        return Arrays.asList(
          new ArtifactVersion(ProjectArtifacts.BYGG_TEST_PLUGIN, "1.0-SNAPSHOT"),
          new ArtifactVersion(ProjectArtifacts.GUICE, "2.0"));
      }
    };
  }
}

Second, use that set of plugins to compile a project configuration:

public class Configuration {
  public static ByggConfiguration configuration() {
    return new ByggConfiguration() {
      public TargetDAG getTargetDAG() {
        return TargetDAG.DEFAULT
           .add("plugin")                  // defines the target name when executing
           .executor(new ByggTestPlugin()) // indicates what executing the target means
           .requires("test")               // means it won't be run until after "test"
           .build();
      }
    };
  }
}

Third, actually execute that build – although in the current version, none of the target executors have an actual implementation, so all they do is create a file with their name and the current time stamp under the target directory. The default build graph that is implemented contains targets that pretend to assemble the classpaths (placeholder for downloading any necessary dependencies) for the main code and the test code, targets that compile the main and test code, a target that runs the tests, and a target that pretends to create a package. As the sample code above hints, I’ve got three projects: Bygg itself, a dummy plugin, and a test project whose build requires the test plugin to be compiled and executed.

Fourth – with no example – cleaning up the target directory. This is the only feature that is fully implemented, being of course a trivial one. On my machine, running a clean in a tiny test project is 4-5 times faster using Bygg than Maven (taking 0.4 to 0.5 seconds of user time as compared to more than 2 for Maven), so thus far, I’m on target with regard to performance improvements. A little side note on cleaning is that I’ve come to the conclusion that clean isn’t a target. You’re never ever going to want to deploy a ‘clean’, or use it for anything. It’s an optional step that might be run before any actual target. To clarify that distinction, you specify targets using their names as command line arguments, but cleaning using -c or –clean:


bygg.sh -c compile plugin

As is immediately obvious, there’s a lot of rough edges here. The ones I know I will want to fix are:

  • Using annotations (instead of naming conventions) and generics (for type safety) in the configuration classes – I’m compiling and loading the configuration files using a library called Janino, which has an API that I think is great, but which by default only supports Java 4. There’s a way around it, but it seems a little awkward, so I’m planning on stealing the API design and putting in a front for JavaCompiler instead.
  • Updating the returned classes (Plugins and ByggConfiguration), as today they only contain a single element. Either they should be removed, or perhaps they will need to become a little more powerful.
  • Changing the names of stuff – TargetDAG especially is not great.
  • There’s a lot of noise, much of which is due to Java as a language, but some of which can probably be removed. The Plugins example above is 11 lines long, but only 2 lines contain useful information – and that’s not counting import statements, etc. Of course, since the number of ‘noise lines’ is pretty much constant, with realistically large builds, the signal to noise ratio will improve. Even so, I’d like it to be better.
  • I’m pretty sure it’s a good idea to move away from saying “this target requires target X” to define the order of execution towards something more like “this target requires the compiled main sources”. But there may well be reasons why you would want to keep something like “requires” or “before” in there – for instance, you might want to generate a properties file with information collected from version control and CI system before packaging your artifact. Rather than changing the predefined ‘package’ target, you might want to just say ‘run this before package’ and leave the file sitting in the right place in the target directory. I’m not quite sure how best to deal with that case yet – there’s a bit more discussion of this a little later.

Anyway, all that should be done in the light of some better understanding of what is actually needed to build something. So before I continue with improving the API, I want to take a few more steps on the path of execution.

As I’ve mentioned in a previous post, a build configuration in Bygg is a DAG (Directed Acyclic Graph). A nice thing about that is that it opens up the possibility of executing independent paths on the DAG concurrently. Tim pointed out to me that that kind of concurrent execution is an established idea called Dataflow Concurrency. In Java terms, Dataflow Concurrency essentially boils down to communicating all shared mutable state via Futures (returned by Callables executing the various tasks). What’s interesting about the Bygg version of Dataflow Concurrency is that the ‘Dataflow Variables’ can and will be parameters of the Callables executing tasks, rather than being hard-coded as is typical in the Dataflow Concurrency examples I’ve seen. So the graph will exist as a data entity as opposed to being hardwired in the code. This means that deadlock detection is as simple as detecting cycles in the graph – and since there is a requirement that the build configuration must be a DAG, builds will be deadlock free. In general, I think the ability to easily visualise the exact shape of the DAG of a build is a very desirable thing in terms of making builds easily understandable, so that should probably be a priority when continuing to improve the build configuration API.

Another idea I had from the reference to dataflow programming is that the canonical example of dataflow programming is a spreadsheet, where an update in one cell trickles through into updates of other cells that contain formulas that refer to the first one. That example made me change my mind about how different targets should communicate their results to each other. Initially, I had been thinking that most of the data that needs to be passed on from one target to the next should be implicitly located in a well-defined location on disk. So the test compiler would leave the test classes in a known place where the test runner knows to look for them. But that means loading source files into memory to compile them, then writing the class files to disk, then loading them into memory again. That’s a lot of I/O, and I have the feeling that I/O is often one of the things that slows builds down the most. What if there would be a dataflow variable with the classes instead? I haven’t yet looked in detail at the JavaFileManager interface, but it seems to me that it would make it possible to add an in-memory layer in front of the file system (in fact, I think that kind of optimisation is a large part of the reason why it exists). So it could be a nice optimisation to make the compiler store files in memory for test runners, packagers, etc., to pick up without having to do I/O. There would probably have to be a target (or something else, maybe) that writes the class files to disk in parallel with the rest of the execution, since the class files are nice to have as an optimisation for the next build – only recompiling what is necessary. But that write doesn’t necessarily have to slow down the test execution. All that is of course optimisation, so the first version will just use a plain file system-based JavaFileManager implementation. Still, I think it is a good idea to only have a very limited number of targets that directly access the file system, in order to open up for that kind of optimisation. The remainder of the targets should not be aware of the exact structure of the target directory, and what data is stored there.

I’m hoping to soon be able to find some more time to try these ideas out in code. It’ll be interesting to see how hard it is to figure out a good way to combine abstracting away the ‘target’ directory with full freedom for plugins to add stuff to the build package and dataflow concurrency variables.

, , ,

Leave a comment

Bygg – Better Dependency Management

I’ve said before that the one thing that Maven does amazingly well is dependency management. This post is about how to do it better.

Declaring and Defining Dependencies

In Maven, you can declare dependencies using a <dependencyManagement /> section in either the current POM or some parent POM, and then use them – include them into the current build – using a <dependencies /> section. This is very useful because it allows you to define in a single place which version of some dependency should be used for a set of related modules. It looks something like:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>com.shopzilla.site2.service</groupId>
      <artifactId>common</artifactId>
      <version>${service.common.version}</version>
    </dependency>
    <dependency>
      <groupId>com.shopzilla.site2.service</groupId>
      <artifactId>common-web</artifactId>
      <version>${service.common.version}</version>
    </dependency>
   <dependency>
       <groupId>com.shopzilla.site2.core</groupId>
       <artifactId>core</artifactId>
       <version>${core.version}</version>
     </dependency>
   </dependencies>
 </dependencyManagement>

  <dependencies>
   <dependency>
     <groupId>com.shopzilla.site2.service</groupId>
     <artifactId>common</artifactId>
   </dependency>
   <dependency>
     <groupId>com.shopzilla.site2.core</groupId>
     <artifactId>core</artifactId>
   </dependency>
 </dependencies>

To me, there are a couple of problems here:

  1. Declaring a dependency looks pretty much identical to using one – the only difference is that the declaration is enclosed inside a <dependencyManagement /> section. This makes it hard to know what it is you’re looking at if you have a large number of dependencies – is this declaring a dependency or actually using it?
  2. It’s perfectly legal to add a <version/> tag in the plain <dependencies /> section, which will happen unless all the developers touching the POM a) understand the distinction between the two <dependencies /> sections, and b) are disciplined enough to maintain it.

For large POMs and POM hierarchies in particular, the way to define shared dependencies becomes not only overly verbose but also hard to keep track of. I think it could be made much easier and nicer in Bygg, something along these lines:


// Core API defined in Bygg proper
public interface Artifact {
   String getGroupId();
   String getArtifactId();
}

// ----------------------------------------------

// This enum is in some shared code somewhere - note that the pattern of declaring artifacts
// using an enum could be used within a module as well if desired. You could use different enums
// to define, for instance, sets of artifacts to use when talking to databases, or when developing
// a web user interface, or whatever.
public enum MyArtifacts implements Artifact {
    SERVICE_COMMON("com.shopzilla.site2.service", "common"),
    CORE("com.shopzilla.site2.core", "core");
    JUNIT("junit", "junit");
}

// ----------------------------------------------

// This can be declared in some shared code as well, probably but not
// necessarily in the same place as the enum. Note that the type of
// the collection is Collection, indicating that it isn't ordered.
public static final Collection ARTIFACT_VERSIONS = ImmutableList.of(
        new ArtifactVersion(SERVICE_COMMON, properties.get("service.common.version")),
        new ArtifactVersion(CORE, properties.get("core.version")),
        new ArtifactVersion(JUNIT, "4.8.1"));

// ----------------------------------------------

// In the module to be built, define how the declared dependencies are used.
// Here, ordering might be significant (it indicates the order of artifacts on the
// classpath - the reason for the 'might' is I'm not sure if this ordering can carry
// over into a packaged artifact like a WAR).
List moduleDependencies = ImmutableList.of(
    new Dependency(SERVICE_COMMON, EnumSet.of(MAIN, PACKAGE)),
    new Dependency(CORE, EnumSet.of(MAIN, PACKAGE)),
    new Dependency(JUNIT, EnumSet.of(TEST)));

The combination of a Collection of ArtifactVersions and a List of Dependency:s is then used by the classpath assembler target to produce an actual classpath for use in compiling, running, etc. Although the example code shows the Dependency:s as a plain List, I kind of think that there may be value in not having the actual dependencies be something else. Wrapping the list in an intelligent object that gives you filtering options, etc., could possibly be useful, but it’s a bit premature to decide about that until there’s code that makes it concrete.

The main ideas in the example above are:

  1. Using enums for the artifact identifiers (groupId + artifactId) gives a more succinct and harder-to-misspell way to refer to artifacts in the rest of the build configuration. Since you’re editing this code in the IDE, finding out exactly what the artifact identifier means (groupId + artifactId) is as easy as clicking on it while pressing the right key.
  2. If the build configuration is done using regular Java code, shared configuration items can trivially be made available as Maven artifacts. That makes it easy to for instance have different predefined groups of related artifacts, and opens up for composed rather than inherited shared configurations. Very nice!
  3. In the last bit, where the artifacts are actually used in the build, there is an outline of something I think might be useful. I’ve used Maven heavily for years, and something I’ve never quite learned how they work is the scopes. It’s basically a list of strings (compile, test, provided, runtime) that describe where a certain artifact should be used. So ‘compile’ means that the artifact in question will be used when compiling the main source code, when running tests, when executing the artifact being built and (if it is a WAR), when creating the final package. I think it would be far simpler to have a set of flags indicating which classpaths the artifact should be included in. So MAIN means ‘when compiling and running the main source’, TEST is ditto for the test source, and PACKAGE is ‘include in the final package’, and so on. No need to memorise what some scope means, you can just look at the set of flags.

Another idea that I think would be useful is adding an optional Repository setting for an Artifact. With Maven, you can add repositories in addition to the default one (Maven Central at Ibiblio). You can have as many as you like, which is great for some artifacts that aren’t included in Maven Central. However, adding repositories means slowing down your build by a wide margin, as Maven will check each repository defined in the build for updates to each snapshot version of an artifact. Whenever I add a repository, I do that to get access to a specific artifact. Utilising that fact by having two kinds of repositories – global and artifact-specific, maybe – should be simple and represent a performance improvement.

Better Transitive Conflict Resolution

Maven allows you to not have to worry about transitive dependencies required by libraries you include. This is, as I’ve argued before, an incredibly powerful feature. But the thing is, sometimes you do need to worry about those transitive dependencies: when they introduce binary incompatibilities. A typical example is something I ran into the other day, where a top-level project used version 3.0.3 of some Spring jars  (such as ‘spring-web’), while some shared libraries eventually included another version of Spring (the  ‘spring’ artifact, version 2.5.6). Both of these jars contain a definition of org.springframework.web.context.ConfigurableWebApplicationContext, and they are incompatible. This leads to runtime problems (in other words, the problem is visible too late; it should be detected at build time), and the only way to figure that out is to recognise the symptoms of the problem as a “likely binary version mismatch”, then use mvn dependency:analyze to figure out possible candidates and add exclude rules like this to your POM:

<dependencyManagement>
  <dependencies>
    <dependency>
       <groupId>com.shopzilla.site2.service</groupId>
       <artifactId>common</artifactId>
       <version>${service.common.version}</version>
       <exclusions>
          <exclusion>
            <groupId>org.springframework</groupId>
            <artifactId>spring</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.collections</groupId>
            <artifactId>google-collections</artifactId>
          </exclusion>
        </exclusions>
      </dependency>
      <dependency>
        <groupId>com.shopzilla.site2.core</groupId>
        <artifactId>core</artifactId>
        <version>${core.version}</version>
        <exclusions>
          <exclusion>
            <groupId>org.springframework</groupId>
            <artifactId>spring</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.collections</groupId>
            <artifactId>google-collections</artifactId>
          </exclusion>
        </exclusions>
      </dependency>
  </dependencies>
</dependencyManagement>

As you can tell from the example (just a small part of the POM) that I pasted in, I had a similar problem with google-collections. The top level project uses Guava, so binary incompatible versions included by the dependencies needed to be excluded. The problems here are:

  1. It’s painful to figure out what libraries cause the conflicts – sometimes, you know or can easily guess (like the fact that different versions of Spring packages can clash), but other times you need to know something a little less obvious (like the fact that Guava has superseded google-collections, something not immediately clear from the names). The tool could just tell you that you have binary incompatibilities on your classpath (I actually submitted a patch to the Maven dependency plugin to fix that, but it’s been stalled for 6 months).
  2. Once you’ve figured out what causes the problem, it’s a pain to get rid of all the places it comes from. The main tool at hand is the dependency plugin, and the way to figure out where dependencies come from is mvn dependency:tree. This lets you know a single source of a particular dependency. So for me, I wanted to find out where the spring jar came from – that meant running mvn dependency:tree, adding an exclude, running it again to find where else the spring jar was included, adding another exclude, and so on. This could be so much easier. And since it could be easier, it should be.
  3. What’s more, the problems are sometimes environment-dependent, so you’re not guaranteed that they will show up on your development machine. I’m not sure about the exact reasons, but I believe that there are differences in the order in which different class loaders load classes in a WAR. This might mean that the only place you can test if a particular problem is solved or not is your CI server, or some other environment, which again adds pain to the process.
  4. The configuration is rather verbose and you need to introduce duplicates, which makes your build files harder to understand at a glance.

Apart from considering binary incompatibilities to be errors (and reporting on exactly where they are found), here’s how I think exclusions should work in Bygg:


 dependencies.exclude().group("com.google.collections").artifact("google-collections")
          .exclude().group("org.springframework").artifact("spring.*").version(except("3.0.3"));

Key points above are:

  1. Making excludes a global thing, not a per-dependency thing. As soon as I’ve identified that I don’t want spring.jar version 2.5.6 in my project, I know I don’t want it from anywhere at all. I don’t care where it comes from, I just don’t want it there! I suppose there is a case for saying “I trust the core library to include google-collections for me, but not the common-web one”, so maybe global excludes aren’t enough. But they would certainly have helped me tremendously a lot of the times I’ve had to use Maven exclusions, and I can’t think of a case where I’ve actually wanted specifically to have an artifact-specific exclusion.
  2. Defining exlusion filters using a fluent API that includes regular expressions. With Spring in particular, you want to make sure that all your jars have the same version. It would be great to be able to say that you don’t want anything other than that.

Build Java with Java?!

I’ve gone through different phases when thinking about using Java to configure builds rather than XML. First, I thought “it’s great, because it allows you to more easily use the standard debugger for the builds and thereby resolve Maven’s huge documentation problem”. But then I realised that the debugging is enabled by ensuring that the IDE has access to the source code of the build tool and plugins that execute the build, and that how you configure it is irrelevant. So then I thought that using Java to configure is pretty good anyway, because it means developers won’t need to learn a new language (as with Buildr or Raven), and that IDE integration is a lot easier. The IDE you use for your regular Java programming wouldn’t need to be taught anything specific to deal with some more Java code. I’ve now come to the conclusion that DSL-style configuration APIs, and even more, using the standard engineering principles for sharing and reusing code for build configurations is another powerful argument in favour of using Java in the build configuration. So I’ve gone from “Java configuration is key”, to “Java configuration is OK, but not important” to “Java configuration is powerful”.

, , ,

Leave a comment

Bygg – Ideas for a Better Maven

In a previous post, I outlined some objectives for a better Maven. In this post, I want to talk about some ideas for how to achieve those objectives.

Build Java with Java

The first idea is that the default modus operandi should be that builds are runnable as a regular Java application from inside the IDE, just like the application that is being built. The IDE integration of the build tool should ensure that source code for the build tool and its core plugins is attached so that the developer can navigate through the build tool source. A not quite necessary part of this is that build configuration should be done using Java rather than XML, Ruby, or something else. This gives the following benefits:

  • Troubleshooting the build is as easy as just starting it in the debugger and setting break points. To understand how to use a plugin, you can navigate to the source implementing it from your build configuration. I think this is potentially incredibly valuable. If it is possible to integrate with IDEs to such an extent that the full source code of the libraries and plugins that are used in the build is available, that means that stepping through and debugging builds is as easy as for any library used in the product being built. And harnessing the highly trained skills that most Java developers have developed for understanding how a third-party library works is something that should make the build system much more accessible compared to having to rely on documentation.
  • It can safely be assumed that Java developers are or at least want to be very proficient in Java. Using a real programming language for build configurations is very advantageous compared to using XML, especially for those occasions when you have to customise your build a little extra. (This should be the exception rather than the rule as it increases the risk of feature creep and reduces standardisation.)
  • The IDEs that Java developers use can be expected to be great tools for writing Java code. Code completion, Javadoc popups, instant navigation to the source that implements a feature, etc. This should reduce the complexity of IDE integration.
  • It opens up for very readable DSL-style APIs, which should reduce the build script complexity and increase succinctness. Also, using Java means you could instantly see which configuration options are available for the thing you’re tweaking at the moment (through method completion, checking values of enums, etc., etc.).

There are some drawbacks, such as having to figure out a way to bootstrap the build (the build configuration needs to be compiled before a build can be run), and the fact that you have to provide a command-line tool anyway for continuous integration, etc. But I don’t think those problems are hard enough to be significant.

I first thought of the idea of using Java for the build configuration files, and that that would be great. But as I’ve been thinking about it, I’ve concluded that the exact format of the configuration files is less important than making sure that developers can navigate through the build source code in exactly the same way as through any third party library. That is something I’ve wanted to do many times when having problems with Maven builds, but it’s practically impossible today. There’s no reason why it should be harder to troubleshoot your build than your application.

Interlude: Aspects of Build Configurations

Identifying and separating things that need to be distinct and treated differently is one of the hard things to do when designing any program, and one of the things that has truly great effect if you get it right. So far, I’ve come up with a few different aspects of a build, namely:

  • The project or artifact properties – this includes things such as the artifact id, group id, version etc., and can also include useful information such as a project description, SCM links, etc.
  • The project’s dependencies – the third-party libraries that are needed in the build; either at compile time, test time or package time.
  • The build properties – any variables that you have in your build. A lot of times, you want to be able to have environment-specific variables that you use, or to refer to build-specific variables such as the output directory for compiled classes. The distinction between variables that may be overridden on a per-installation basis from variables that are determined during the build may mean that there are more than one kind of properties.
  • The steps that are taken to complete the build – compilations, copying, zipping, running tests, and so on.
  • The things that execute the various steps – the compiler, the test executor, a plugin that helps generate code from some type of resources, etc.

The point of separating these aspects of the build configuration is that it is likely that you’ll want to treat them differently. In Maven, almost everything is done in a single POM file, which grows large and hard to get an overview of, and/or in a settings.xml file that adds obscurity and reduces build portability. An example of a problem with bunching all the build configuration data into a single file is IntelliJ’s excellent Maven plugin, which wants to reimport the pom.xml with every change you make. Reimporting takes a lot of time on my chronically overloaded Macbook (well, 5-15 seconds, I guess – far too much time). It’s necessary because if I’ve changed the dependencies, IntelliJ needs to update its classpath.  The thing is, I think more than 50% or even 70% of the changes I make to pom files don’t affect the dependencies. If the dependencies section were separate from the rest of the build configuration, reimporting could be done only when actually needed.

I don’t think this analysis has landed yet, it feels like some pieces or nuances are still missing. But it helps as background for the rest of the ideas outlined in this post. The steps taken during the build, and the order in which they should be taken, are separate from the things that do something during each step.

Non-linear Build Execution

The second idea is an old one: abandoning Maven’s linear build lifecycle and instead getting back to the way that make does things (which is also Ant’s way). So rather than having loads of predefined steps in the build, it’s much better to be able to specify a DAG of dependencies between steps that defines the order of execution. This is better for at least two reasons: first, it’s much easier to understand the order of execution if you say “do A before B” or “do B after A” in your build configuration than if you say “do A during the process-classes phase and B during the generate-test-sources phase”. And second, it opens up the door to do independent tasks in parallel, which in turn creates opportunities for performance improvements. So for instance, it could be possible to download dependencies for the test code in parallel with the compilation of the production code, and it should be possible to zip up the source code JAR file at the same time as JavaDocs are generated.

What this means in concrete terms is that you would write something like this in your build configuration file:


  buildSteps.add(step("copyPropertiesTemplate")
         .executor(new CopyFile("src/main/template/properties.template",
                                "${OUTPUT_DIR}/properties.txt"))
         .before("package"));

Selecting what to build would be done via the build step names – so if all you wanted to do was to copy the properties template file, you would pass “copyPropertiesTemplate” to the build. The tool would look through the build configuration and in this case probably realise that nothing needs to be run before that step, so the “copyPropertiesTemplate” step would be all that was executed. If, on the other hand, the user stated that the “package” step should be executed, the build tool would discover that lots of things have to be done before – not only “copyPropertiesTemplate” but also “performCoverageChecks”, which in turn requires “executeTests”, and so on.

As the example shows, I would like to add a feature to the make/Ant version: specifying that a certain build step should happen before another one. The reason is that I think that the build tool should come with a default set of build steps that allow the most common build tasks to be run with zero configuration (see below for more on that). So you should always be able to say “compile”, or “test”, and that should just work as long as you stick the the conventions for where you store your source code, your tests, etc. This makes it awkward for a user to define a build step like the one above in isolation, and then after that have to modify the pre-existing “package” step to depend on the “copyPropertiesTemplate” one.

In design terms, there would have to be some sort of BuildStep entity that has a unique name (for the user interface), a set of predecessors and successors, and something that should be executed. There will also have to be a scheduler that can figure out a good order to execute steps in. I’ve made some skeleton implementations of this, and it feels like a good solution that is reasonably easy to get right. One thing I’ve not made up my mind about is the name of this entity – the two main candidates are BuildStep and Target. Build step explains well what it is from the perspective of the build tool, while Target reminds you of Ant and Make and focuses the attention on the fact that it is something a user can request the tool to do. I’ll use build step for the remainder of this post, but I’m kind of thinking that target might be a better name.

Gradual Buildup of DI Scope

Build steps will need to communicate results to one another. Some of this will of necessity be done via the file system – for instance, the compilation step will leave class files in a well-defined place for the test execution and packaging steps to pick up. Other results should be communicated in-memory, such as the current set of build properties and values, or the exact classpath to be used during compilation. The latter should be the output of some assembleClassPath step that checks or downloads dependencies and provides an exact list of JAR files and directories to be used by the compiler. You don’t want to store that sort of thing on the file system.

In-memory parameters should be injected into the executors of subsequent build steps that need them. This means that the build step executors will be gradually adding stuff to the set of objects that can be injected into later executors. A concrete implementation of this that I have been experimenting with is using hierarchical Guice injectors to track this. That means that each step of the build returns a (possibly empty) Guice module, which is then used to create an injector that inherits all the previous bindings from preceding steps. I think that works reasonably well in a linear context, but that merging injectors in a more complex build scenario is harder. A possibly better solution is to use the full set of modules used and created by previous steps to create a completely new injector at the start of each build step. Merging is then simply taking the union of the sets of modules used by the two merging paths through the DAG.

This idea bears some resemblance to the concept of dynamic dependency injection, but it is different in that there are parallel contexts (one for each concurrently executing path) that are mutating as each build step is executed.

I much prefer using DI to inject dependencies into plugins or build steps over exposing the entire project configuration + state to plugins, for all the usual reasons. It’s a bit hard to get right from a framework perspective, but I think it should help simplify the plugins and keep them isolated from one another.

Core Build Components Provided

One thing that Maven does really well is to support simple builds. The minimal pom is really very tiny. I think this is great both in terms of usability and as an indication of powerful build standardisation/strong conventions. In the design outlined in this post, the things that would have to come pre-configured with useful defaults are a set of build steps with correct dependencies and associated executors. So there would be a step that assembles the classpath for the main compilation, another one that assembles the classpath for the test compilation, yet another one that assembles the classpath for the test execution and probably even one that assembles the classpath to be used in the final package. These steps would probably all make use of a shared entity that knows how to assemble classpaths, and that is configured to know about a set of repositories from which it can download dependencies.

By default, the classpath assembler would know about one or a few core repositories (ibiblio.org for sure). Most commercial users will hopefully have their own internal Maven repositories, so it needs to be possible to tell the classpath assembler about these. Similarly, the compiler should have some useful defaults for source version, file encodings, etc., but they should be possible to override in a single place and then apply to all steps that use them.

Of course, the executors (classpath assembler, compiler, etc.) would be shared by build steps by default, but they shouldn’t necessarily be singletons – if one wanted to compile the test code using a different set of compiler flags, one could configure the build to have two compiler instances with different parameters.

The core set of build steps and executors should at a minimum allow you to build, test and deploy (in the Maven sense) a JAR library. Probably, there should be more stuff that is considered to be core than just that.

Naming the Tool

The final idea is a name – Bygg. Bygg is a Swedish word that means “build” as in “build X!”, not as in “a build” or “to build” (a verb in imperative form in other words). It’s probably one letter too long, but at least it’s not too hard to type the second ‘g’ when you’ve already typed the first. It’s got the right number of syllables and it means the right thing. It’s even relatively easy to pronounce if you know English (make it sound like “big” and you’ll be fine), although of course you have to have Scandinavian background to really pronounce the  “y” right.

That’s more than enough word count for this post. I have some more ideas about dependency management, APIs and flows in the tool, but I’ll have to save them for another time. Feel free to comment on these ideas, particularly about areas where I am missing the mark!

, ,

5 Comments

Objectives for A Better Maven

My friend Josh Slack made me aware of this post, by a guy (Kent Spillner) who is totally against Maven in almost every way. As I’ve mentioned before, I think Maven is the best tool out there for Java builds, so of course I like it better than Kent does. Still, there’s no doubt he has some points that you can’t help agreeing with. Reading his post made me think (once again) about what is great and not so great about Maven, and also of some ideas about how to fix the problems whilst retaining the great stuff (edit: I’ve started outlining these ideas here, with more to follow).

First, some of the things that are great:

  1. Dependency Management – I would go so far as to argue that Maven has done more to enable code reuse than anything else that is touted as a ‘reusability paradigm’ (such as OO itself). Before Maven and its repositories, you had to manually add every single dependency and their transitive requirements into each project, typically even into your source repository. The amount of manual effort to upgrade from one version of a library, and its transitive dependencies, means the optimal size of a library is quite large, making them unfocused and bloated. What’s more, it also means that library designers have a strong need to reduce the number of things they allow themselves to depend on, which reduces the scope for code reuse. With Maven, libraries can be more focused as it is effortless to have a deep dependency tree. At Shopzilla, our top-level builds typically include 50-200 dependencies. Imagine adding these to your source repository and keeping them up to date with every change – completely impossible!
  2. Build standardisation. The first sentence in Kent Spillner’s post is “The best build tool is the one you write yourself”. That’s probably true from the perspective of a single project, but with a larger set of projects that are collaboratively owned by multiple teams of developers, that idea breaks quickly. Again, I’ll use Shopzilla as an example – we have more than 100 Git repositories with Java code that are co-owned by 5-6 different teams. This means we must have standardised builds, or we would waste lots of time due to having to learn about custom builds for each project. Any open source project exists in an even larger ecosystem; essentially a global one. So unless you know that the number of developers who will be building your project is always going to be small, and that these developers will only have to work with a small number of projects, your build should be “mostly declarative” and as standardised as you can make it.
  3. The wealth of plugins that allow you to do almost any build-related task. This is thanks to the focus on a plugin-based architecture right from the get-go.
  4. The close integration with IDEs that makes it easier (though not quite painless) to work with it.

Any tool that would improve on Maven has to at least do equally well on those four counts.

To get a picture of the opportunities for improvement, here’s my list of Maven major pain points:

  1. Troubleshooting is usually hard to extremely hard. When something breaks, you get very little help from Maven to figure out what it is. Enabling debug level logging on the build makes it verbose to the point of obscuring the error. If you, like me, like to use the source code to find out what is happening, it is difficult to find because you will have to jump from plugin to plugin, and most plugins have different source repositories.
  2. Even though there is a wealth of plugins that allow you to do almost anything, it is usually unreasonably hard to a) find the right plugin and b) figure out how to use it. Understanding what Maven and its plugins do is really hard, and the documentation is frequently sub-standard.
  3. A common complaint is the verbose XML configuration. That is definitely an issue, succinctness improves readability and ease of use.
  4. The main drawback of the transitive dependency management is the risk of getting incompatible versions of the same library, or even worse, class. There is very little built-in support for managing this problem in Maven (there is some in the dependency plugin, but actually using that is laborious). This means that it is not uncommon to have large numbers of ‘exclusion’ tags for some dependencies polluting your build files, and that you anyway tend to have lots of stuff that you never use in your builds.
  5. Maven is slow, there’s no doubt about that. It takes time to create various JVMs, to download files/check for updates, etc. Also, every build runs through the same build steps even if some of them are not needed.
  6. Builds can succeed or fail on different machines for unobvious reasons – typically, the problem is due to differing versions of SNAPSHOT dependencies being installed in the local repository cache. It can also be due to using different versions of Maven plugins.

There’s actually quite a lot more that could be improved, but those are probably my main gripes. When listing them like this, I’m surprised to note that despite all these issues, I still think Maven is the best Java build tool out there. I really do think it is the best, but there’s no doubt that there’s plenty of room to improve things. So I’ve found myself thinking about how I would go about building a better Maven. I am not sure if I’ll be able to actually find the time to implement it, but it is fascinating enough that I can’t let go of the idea. Here’s what I would consider a useful set of objectives for an improved Maven, in order of importance:

  1. Perfect interoperability with the existing Maven artifact management repository infrastructure. There is so much power and value in being able to with near-zero effort get access to pretty much any open source project in the Java world that it is vital to be able to tap into that. Note that the value isn’t just in managing dependencies in a similar way to how Maven does it, but actually reusing the currently available artifacts and repositories.
  2. Simplified troubleshooting. More and better consistency checks of various kinds and at earlier stages of the build. Better and more to the point reporting of problems. Great frameworks tend to make this a key part of the architecture from the get-go rather than add it on as an afterthought.
  3. A pluggable architecture that makes it easy to add custom build actions. This is one of Maven’s great success points so a new framework has to be at least as good. I think it could and should be even easier than Maven makes it.
  4. Encouraging but not enforcing standardised builds. This means sticking to the idea of “convention over configuration“. It also means that defining your build should be a “mostly declarative” thing, not an imperative thing. You should say “I want a JAR”, not “zip up the class files in directory X”. Programming your builds is sometimes a necessary evil so it must be possible, but it should be discouraged as it is a slippery slope that leads to non-standardised builds, which in turn means making it harder for anybody coming new to a project to get going with it.
  5. Great integration with IDEs, or at least support for the ability to create great integration with IDEs. This is a necessary part of giving programmers a workflow that never forces them out of the zone.
  6. Less verbose configuration. Not a show-stopper in my opinion, but definitely a good thing to improve.
  7. EDIT: While writing further posts on this topic, I’ve realised that there is one more thing that I consider very important: improving performance. Waiting for builds is a productivity-killing drag.

It’s good to specify what you want to do, but in some ways, it’s even better to specify things you’re not trying to achieve either because they’re not relevant or because you find them counter-productive. That gives a different kind of clarity. So here’s a couple of non-objectives:

  1. Using the same artifact management mechanism for build tool plugins as for project dependencies, the way Maven does. While there is some elegance to this idea, it also comes with a host of difficulties – unreproducible builds being the main one, and while earlier versions of Maven actively updated plugin versions most or all the time, Maven 3 now issues warnings if you haven’t specified the plugin versions for your build.
  2. Reimplementing all the features provided by Maven plugins. Obviously, trying to out-feature something so feature-rich as Maven would be impossible and limit the likelihood of success hugely. So one thing to do is to select a subset of build steps that represent the most common and/or most different things that are typically done in a build and then see how well the framework deals with that.
  3. Being compatible with Maven plugins. In a way, it would be great to be able for a new build tool to be able to use any existing Maven plugin. But being able to do that would limit the architectural options and increase the complexity of the new architecture to the point of making it unlikely to succeed.
  4. Reproducing the ‘project information’ as a core part of the new tool. Producing project information was one of the core goals of Maven when it was first created. I personally find that less than useful, and therefore not worth making into a core part of a better Maven. It should of course be easy to create a plugin that does this, but it doesn’t have to be a core feature.

I’ve got some ideas for how to build a build tool that meets or is likely to meet most of those objectives. But this post is already more than long enough, and I’d anyway like to stop here to ask for some feedback. Any opinions on the strengths, weaknesses and objectives outlined here?

,

6 Comments

The Power of Standardisation

A couple of recent events made me see the value of company-internal standardisation in a way that I hadn’t before. Obviously, reusing standardised solutions for software development is a good thing, as it is easier to understand them. But I’ve always rated continuous evolution of your technologies and choosing the best tool for the problem at hand higher. I’m beginning to think that was wrong, and that company-internal standards are or can be a very important consideration when choosing solutions.

Let’s get to the examples. First, we’ve had a set of automated functional regression tests that we can run against our sites for quite some time. But since the guy who developed those tests left, it has turned out that we can’t run them any more. The reason is that they were developed using Perl, a programming language that most of the team members are not very comfortable with, and that the way you would run them involved making some manual modifications to certain configuration files, then selecting the right ‘main’ file to execute. We’ve recently started replacing those tests with new ones written in Java and controlled by TestNG. This means it was trivial for us to run the tests through Maven. Some slight cleverness (quite possibly a topic for another blog post) allows us to run the tests for different combinations of sites (our team runs 8 different sites with similar but not identical functionality) and browsers using commands like this:

mvn test -Pall-sites -Pfirefox -Pchrome -Pstage

This meant it was trivial to get Hudson to run these tests for us against the staging environment and that both developers and QA can run the tests at any time.

The second example is also related to automated testing – we’ve started creating a new framework for managing our performance tests. We’ve come to the conclusion that our team in the EU has different needs to the team in the US that maintains our current framework, and in the interest of perfect agility, we should be able to improve our productivity by owning the tools we work with. We just deployed the first couple of versions of that tool, and our QA Lead immediately told me that he felt that even though the tool is still far inferior to the one it will eventually replace, he was really happy to have a plain Java service to deploy as opposed to the current Perl/CGI-based framework. Since Java services are our bread and butter at Shopzilla (I’ve lost count, but our systems probably include about 30-50 services, most of which are written in Java and use almost RESTful HTTP+XML APIs), we have great tools that support automated deployment and monitoring of these services.

The final example was a program for batch processing of files that we need for a new feature. Our initial solution was a plain Java executable that monitored a set of directories for files to process. However, it quickly became obvious that we didn’t know how to deal with that from a configuration management/system operations perspective. So even though the development and QA was done, and the program worked, we decided to refit it as one of our standard Java services, with a feature to upload files via POST requests instead of monitoring directories.

In all of these cases, there are a few things that come to mind:

  • We’ve invested a lot in creating tools and processes around managing Java services that are deployed as WARs into Tomcat. Doing that is dead easy for us.
  • We get a lot of common features for free when reusing this deployment form: standard solutions for setting server-specific configuration parameters, logging, load balancing, debugging, build numbers, etc.
  • Every single person working here is familiar with our standard Java/Tomcat services. Developers know where to find the source code and where the entry points are. QA knows where to find log files, how to deploy builds for testing and how to check the current configuration. CM knows how to configure environments and how to set up monitoring tools, and so on.

I think there is a tendency among developers – certainly with me – to think only about their own part when choosing the tools and technologies for developing something new. So if I would be an expert in, say, Ruby on Rails, it would probably be very easy for me to create some kind of database admin tool using that. But everybody else would struggle with figuring out how to deal with it – where can I find the logs, how do I build it, how is it deployed and how do I set up monitoring?

There is definitely a tradeoff to be made between being productive today with existing tools and technologies and being productive tomorrow through migrating to newer and better ones. I think I’ve not had enough understanding of how much smoother the path can be if you stay with the standard solutions compared to introducing new technologies. My preference has always been to do gradual, almost continuous migration to newer tools and technologies, to the point of having that as an explicit strategy at Jadestone a few years ago. I am now beginning to think it’s quite possible that it is better to do technology migrations as larger, more discrete events where a whole ecosystem is changed or created at the same time. In the three cases above, we’re staying on old, familiar territory. That’s the path of least resistance and most bang for the buck.

, ,

Leave a comment

Finding Duplicate Class Definitions Using Maven

If you have a largish set of internal libraries with a complex dependency graph, chances are you’ll be including different versions of the same class via different paths. The exact version of the class that gets loaded seems to depend on the combination of JVM, class loader and operating system that happens to be used at the time. This can cause builds to fail on some systems but not others and is quite annoying. When this has been happening to me, it’s usually been for one of two reasons:

  1. We’ve been restructuring our internal artifacts, and something was moved from artifact A to B, only the project in question is still on a version of artifact A that is “pre-removal”. This often leads to binary incompatibilities if the class has evolved since being moved to artifact B.
  2. Two artifacts in the dependency graph have dependencies on artifacts that, while actually different as artifacts, contain class files for the same class. This can typically happen with libraries that provide jar distributions that include all dependencies, or where there are distributions that are partial or full.

On a couple of previous occasions, when trying to figure out how duplicate class definitions made it into projects I’ve been working on, I’ve gone through a laborious manual process to list class names defined in jars, and see which ones are repeated in more than one. I thought that a better option might be to see if that functionality could be added into the Maven dependency plugin.

My original idea was to add a new goal, something like ‘dependency:duplicate-classes’, but when looking a little more closely at the source code of the dependency plugin, I found that the dependency:analyze goal had all the information needed to figure out which classes are defined more than once. So I decided to make a version of the maven-dependency-plugin where it is possible to detect duplicate class definitions using ‘mvn dependency:analyze’.

The easiest way to run the updated plugin is like this:

mvn dependency:analyze -DcheckDuplicateClasses

The output if duplicate classes are found is something like:

[WARNING] Duplicate class definitions found:
[WARNING]    com.shopzilla.common.data.ObjectFactory defined in:
[WARNING]       com.shopzilla.site.url.c14n:model:jar:1.4:compile
[WARNING]       com.shopzilla.common.data:data-model-schema:jar:1.23:compile
[WARNING]    com.shopzilla.site.category.CategoryProvider defined in:
[WARNING]       com.shopzilla.site2.sasClient:sas-client-core:jar:5.47:compile
[WARNING]       com.shopzilla.site2.service:common-web:jar:5.50:compile

If you would like to try the updated plugin on your project, here’s how to do it:

  1. Get the forked code for the dependency analyzer goal from http://github.com/pettermahlen/maven-dependency-analyzer-fork and install it in your local Maven repo by running ‘mvn install’. (It appears that for some people, the unit tests fail during this process – I’ve not been able to reproduce this, and it’s not the tests that I wrote, so in this case my recommendation would be to simply use -DskipTests=true to ignore them).
  2. Get the forked code for the dependency plugin from http://github.com/pettermahlen/maven-dependency-plugin-fork and install it in your local Maven repo by running ‘mvn install’.
  3. Update your pom.xml file to use the forked version of the dependency plugin (it’s probably also possible to use the plugin registry, but I’ve not tested that):
<build>
  <pluginManagement>
    <plugins>
      <plugin>
        <artifactId>maven-dependency-plugin</artifactId>
        <version>2.2.PM-SNAPSHOT</version>
      </plugin>
    </plugins>
  </pluginManagement>
</build>

I’ve filed a JIRA ticket to get this feature included into the dependency plugin – if you think it would be useful, it might be a good idea to vote for it. Also, if you have any feedback about the feature, feel free to comment here!

,

2 Comments

Code sharing: Use Maven

Maven’s slow progress towards becoming the most accepted Java build tool seems to continue, although a lot of people are still annoyed enough with its numerous warts to prefer Ant or something else. My personal opinion is that Maven is the best build solution for Java programs that is out there, and as somebody said – I’ve been trying to find the quote, but I can’t seem to locate it – when an Ant build is complicated, you blame yourself for writing a bad build.xml, but when it is hard to get Maven to behave, you blame Maven. With Ant, you program it, so any problems are clearly due to a poorly structured program. With Maven you don’t tell it how to do things, you try to tell it what should be done, so any problems feel like the fault of the tool. The thing is, though, that Maven tries to take much more responsibility for some of the issues that lead to complex build scripts than something like Ant does.

I’ve certainly spent a lot of time cursing poorly written build scripts for Ant and other tools, and I’ve also spent a lot of time cursing Maven when it doesn’t do what I want it to. But the latter time is decreasing as Maven keeps improving as a build tool. There’s been lots of attempts to create other tools that are supposed to make builds easier than Maven, but from what I have seen, nothing has yet really succeeded to provide a clearly better option (I’ve looked at Buildr and Raven, for instance). I think the truth is simply that the build process for a large system is a complex problem to solve, so one cannot expect it to be free of hassles. Maven is the best tool out there for the moment, but will surely be replaced by something better at some point.

So, using Maven isn’t going to be problem-free. But it can help with a lot of things, particularly in the context of sharing code between multiple teams. The obvious thing it helps with is the single benefit that most people agree that Maven has – its way of managing dependencies and the massive repository infrastructure and dependency database that is just available out there. On top of that, building Maven projects in Hudson is dead easy, and there’s a whole slew of really nice tools that come with Maven plugins that you can use that enable you to get all kinds of reports and metadata about your code. My current favourite is Sonar, which is great if you want to keep track of how your code base evolves from some kind of aggregated perspective.

Here are some things you’ll want to do if you decide to use Maven for the various projects that make up your system:

  1. Use Nexus as an internal repository for build artifacts.
  2. Use the Maven Release plugin to create releases of internal artifacts.
  3. Create a shared POM for the whole code base where you can define shared settings for your builds.

The word ‘repository’ is a little overloaded in Maven, so it may be confusing. Here’s a diagram that explains the concept and shows some of the things that a repository manager like Nexus can help you with:

The setup includes a Git server (because you use Git) for source control, a Hudson server (or set of) that does continuous integration, a Nexus-managed artifact repository and a developer machine. The Nexus server has three repositories in it: internal releases, internal snapshots and a cache of external repositories. The latter is only there as a performance improvement. The other two are the way that you distribute Maven artifacts within your organisation. When a Maven build runs on the Hudson or developer machines, Maven will use artifacts from the local repository on the machine – by default located in a folder under the user’s home directory. If a released version of an artifact isn’t present in the local repository, it will be downloaded from Nexus, and snapshot versions will periodically be refreshed, even if present locally. In the example setup, new snapshots are typically deployed to the Nexus repository by the Hudson server, and released versions are typically deployed by the developer producing the release. Note that both Hudson and developers are likely to install snapshots to the local repository.

I’ve tried a couple of other repository managers (Archiva, Artifactory and Maven-Proxy), but Nexus has been by a pretty wide margin the best – robust, easy to use and easy to understand. It’s been a year or two since I looked at the other ones, so they may have improved since.

Having an internal repository opens up for code sharing by providing a uniform mechanism for distributing updated versions of internal libraries using the standard Maven deploy command. Maven has two types of artifact versions: releases and snapshots. Releases are assumed to be immutable and snapshots mutable, so updating a snapshot in the internal repository will affect any build that downloads the updated snapshot, whereas releases are supposed to be deployed to the internal repository once only – any subsequent deployments should deploy something that is identical. Snapshots are tricky, especially when branching. If you create two branches of the same library and fail to ensure that the two branches have different snapshot versions, the two branches will interfere,

There is interference between the two branches because they both create updates to the same artifact in the Maven repositories. Depending on the ordering of these updates, builds may succeed or fail seemingly at random. At Shopzilla, we typically solve this problem in two ways: for some shared projects, where we have long-lived/permanent team-specific branches, the team name is included in the version number of the artifact, and for short-lived user story branches, the story ID is included in the version number. So if I need to create a branch off of version 2.3-SNAPSHOT for story S3765, I’ll typically label the branch S3765 and change the version of the Maven artifact to 2.3-S3765-SNAPSHOT. The Maven release plugin has a command that simplifies branching, but for whatever reason, I never seem to use it. Either way, being careful about managing branches and Maven versions is necessary.

A situation where I do use the maven release plugin a lot is when making releases of shared libraries. I advocate a workflow where you make a new release of your top-level project every time you make a live update, and because you want to make live updates frequently and you use scrum, that means a new Maven release with every iteration. To make a Maven release of a project, you have to eliminate all snapshot dependencies – this is a necessary requirement for immutability – so releasing the top level project means make release versions of all its updated dependencies. Doing this frequently reduces the risk of interference between teams by shortening the ‘checkout, modify, checkin’ cycle.

See the pom file example below for some hands-on pom.xml settings that are needed to enable using the release plugin.

The final tip for code sharing using Maven that I wanted to give is to use a shared parent POM that contains settings that should be shared between projects. The main reason is of course to reduce code duplication – any build file is code, of course, and Maven build files are not as easy to understand as one would like, so simplifying them is very valuable. Here’s some stuff that I think should go into a shared pom.xml file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
   <modelVersion>4.0.0</modelVersion>
   <groupId>com.mycompany</groupId>
   <artifactId>shared-pom</artifactId>
   <name>Company Shared pom</name>
   <version>1.0-SNAPSHOT</version>
   <packaging>pom</packaging>

   <!--
      One of the things that is necessary in order to be able to use the
      release plugin is to specify the scm/developerConnection element.
      I usually also specify the plain connection, although
      I think that is only used for generating project
      documentation, a Maven feature I don't find particularly useful
      personally.

      A section like this needs to be present in every project for which
      you want to be able to use the release plugin, with the project-
      specific Git URL.
     -->
   <scm>
     <connection>scm:git:git://GITHOST/GITPROJECT</connection>
     <developerConnection>scm:git:git://GITHOST/GITPROJECT</developerConnection>
   </scm>

   <build>
     <!--
        Use the plugins section to define Maven plugin configurations that
        you want to share between all projects.
       -->
     <plugins>
       <!--
          Compiler settings that are typically going to be identical in all
          projects. With a name like Måhlén, you get particularly sensitive
          to using the only useful character encoding there is.. ;)
         -->
       <plugin>
         <artifactId>maven-compiler-plugin</artifactId>
         <configuration>
           <source>1.6</source>
           <target>1.6</target>
           <encoding>UTF-8</encoding>
         </configuration>
       </plugin>

       <!--
         Tell Maven to create a source bundle artifact during the package
         phase. This is extremely useful when sharing code, as the act of
         sharing means you'll want to create a relatively large number of
         smallish artifacts, so creating IDE projects that refer directly
         to the source code is unmanageable. But the Maven integration of
         a good IDE will fetch the Maven source bundle if available, so if
         you navigate to a class that is included via Maven from your
         top-level project, you'll still see the source version - and even
         the right source version, because you'll get what corresponds
         to the binary that has been linked.
         -->
       <plugin>
         <artifactId>maven-source-plugin</artifactId>
         <executions>
           <execution>
             <phase>package</phase>
             <goals>
               <goal>jar</goal>
             </goals>
           </execution>
         </executions>
       </plugin>

       <!--
         Ensure that a javadoc jar is being generated and deployed. This
         is useful for similar reasons as source bundle generation,
         although to a lesser degree in my opinion. Javadoc is great, but
         the source is always up to date.
         -->
      <plugin>
        <artifactId>maven-javadoc-plugin</artifactId>
        <executions>
          <execution>
            <phase>package</phase>
            <goals>
              <goal>jar</goal>
            </goals>
          </execution>
         </executions>
      </plugin>

      <!--
        The below configuration information was necessary to ensure that
        you can use the maven release plugin with Git as a version control
        system. The exact version numbers that you want to use are likely
        to have changed since then, and it may even be that Git support is
        more closely integrated nowadays, so less explicit configuration
        is needed - I haven't tested that since maybe March 2009.
       -->
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-release-plugin</artifactId>
        <dependencies>
          <dependency>
            <groupId>org.apache.maven.scm</groupId>
            <artifactId>maven-scm-provider-gitexe</artifactId>
            <version>1.1</version>
          </dependency>
          <dependency>
            <groupId>org.codehaus.plexus</groupId>
            <artifactId>plexus-utils</artifactId>
            <version>1.5.7</version>
          </dependency>
        </dependencies>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
          <artifactId>maven-scm-plugin</artifactId>
          <version>1.1</version>
          <dependencies>
            <dependency>
              <groupId>org.apache.maven.scm</groupId>
              <artifactId>maven-scm-provider-gitexe</artifactId>
              <version>1.1</version>
            </dependency>
            <dependency>
              <groupId>org.codehaus.plexus</groupId>
              <artifactId>plexus-utils</artifactId>
              <version>1.5.7</version>
            </dependency>
          </dependencies>
        </plugin>
      </plugins>
    </build>

    <!--
       Configuration of internal repositories so that the sub-projects
       know where to download internally created artifacts from. Note
       that due to a bootstrapping issue, this configuration needs to
       be duplicated in individual projects. This file, the shared POM,
       is available from the Nexus repo, but if the project POM doesn't
       contain the repo config, the project build won't know where to
       download the shared POM.
      -->
    <repositories>
      <!-- internal Nexus repository for released artifacts -->
      <repository>
        <id>internal-releases</id>
        <url>http://NEXUSHOST/nexus/content/repositories/internal-releases</url>
        <releases><enabled>true</enabled></releases>
        <snapshots><enabled>false</enabled></snapshots>
      </repository>
      <!-- internal Nexus repository for SNAPSHOT artifacts -->
      <repository>
        <id>internal-snapshots</id>
        <url>http://NEXUSHOST/nexus/content/repositories/internal-snapshots</url>
        <releases><enabled>false</enabled></releases>
        <snapshots><enabled>true</enabled></snapshots>
      </repository>

      <!--
        Nexus repository cache for third party repositories such as
        ibiblio. This is not necessary, but is likely to be a
        performance improvement for your builds.
        -->
      <repository>
        <id>3rd party</id>
        <url>http://NEXUSHOST/nexus/content/repositories/thirdparty/</url>
        <releases><enabled>true</enabled></releases>
        <snapshots><enabled>false</enabled></snapshots>
      </repository>

   </repositories>

   <distributionManagement>

      <!-- Defines where to deploy released artifacts to -->
      <repository>
        <id>internal-repository-releases</id>
        <name>Internal release repository</name>
        <url>URL TO NEXUS RELEASES REPOSITORY</url>
      </repository>

      <!-- Defines where to deploy artifact snapshot to -->
      <snapshotRepository>
        <id>internal-repository-snapshot</id>
        <name>Internal snapshot repository</name>
        <url>URL TO NEXUS SNAPSHOTS REPOSITORY</url>
     </snapshotRepository>

   </distributionManagement>

</project>

The less pleasant part of using Maven is that you’ll need to learn more about Maven’s internals than you’d probably like, and you’ll most likely stop trying to fix your builds not when you’ve understood the problem and solved it in the way you know is correct, but when you’ve arrived at a configuration that works through trial and error (as you can see from my comments in the example pom.xml above). The benefits you’ll get in terms of simplifying the management of build artifacts across teams and actually also simplifying the builds themselves outweigh the costs of the occasional hiccup, though. A typical top-level project at Shopzilla links in around 70 internal artifacts through various transitive dependencies – managing that number of dependencies is not easy unless you have a good tool to support you, and dependency management is where Maven shines.

,

6 Comments

Git and Maven

There was a recent comment to a bug I posted in the Maven Git SCM Provider that triggered some thoughts. The comment was:

“GIT is a distributed SCM. There IS NO CENTRAL repository. Accept it.

Doing a push during the release process is counter to the GIT model.”

In general, the discussions around that bug have been quite interesting and very different from what I expected when I posted it. My reason for calling it a bug was that an unqualified ‘push‘ tries to push everything in your local git repository to the origin repository. That can fail for some branch that you’ve not kept up to date even if it is a legal operation for the branch that you’re currently doing a release of. Typically, that other branch has moved a bit, so your version is a couple of commits behind. A push in that state will abort the maven release process and leave you with some pretty tricky cleaning up to do (edit: Marta has posted about how to fix that). A lot of people commenting on the bug have made comments about how Git is distributed and therefore push shouldn’t be done at all, or be made optional.

I think that the issue here is that there is an impedance mismatch between Git and Maven. While Git is a distributed version control system – that of course also supports a centralised model perfectly well – the Maven model is fundamentally a centralised one. This is one case where the two models conflict, and my opinion is that the push should indeed happen, just in a way that is less likely to break. The push should happen because when doing a Maven release, supporting Maven’s centralised model is more important than supporting Git’s distributed model.

The main reason why Maven needs to be centralised is the way that artifact versions are managed. If releasing can be done by different people from local repositories without any central coordination, there is a big risk of different people creating artifact versions that are not the same. The act of creating a Maven release is in fact saying that “This binary package is version 2.1 of this artifact, and it will never change”. There should never be two versions of 2.1. Git of course gets around this problem using hashes of the things it version controls instead of sequential numbers, and if two things are identical, they will have the same hash code = the same version number. Maven produces artifacts on a higher conceptual level, where sequential version numbers are important, so there needs to be a central location that determines what is the next version number to use and provides a ‘master’ copy of the published artifacts.

I’ve also thought a bit about centralised versus distributed version management and when the different choices might work, but I think I’ll leave that for another post at another time (EDIT: that time was now). Either way, I think that regardless of the virtues of distributed version management systems like Git, Maven artifacts need to be managed centrally. It would be interesting to think about what a distributed dependency management system would look like…

,

20 Comments