Posts Tagged Builds

Hardcode Behaviour!

Some years ago, when I was learning Git, I watched a presentation by Linus Torvalds, and in passing, he made one of those points that just fits with stuff you’ve been thinking but haven’t yet verbalised and so don’t fully understand. He was talking about how he had obsessed over the performance of some operation in Git (merges if I remember right), because with gradual improvements in performance, there’s a quantum leap where the pain of doing something goes away. And when it’s not painful, you can do it often and that can completely change the way you work, opening up avenues that used to be closed.

This thought can be generalised to a lot of areas – gradual improvements that suddenly lead to a change in what you can do. One such scenario that I’ve been thinking about a little lately is the pattern of using databases (as in something external to the code – XML files, properties files, whatever) for configuration data. The rationale for that is to make change easier and quicker, and the pattern comes out of a situation where the next release is some months away, but a database change can be done in minutes. These days, though, that situation should no longer apply when building web services. If you do things right, it should be possible to do the next release within minutes or at least hours, and this means that you can hardcode your some of your configuration data instead of using a database.

Hardcoding configurable options gives the following benefits:

  • Consistency across environments – with databases, there’s a risk (or a guarantee, more or less) that environments will differ. This will lead to surprises and/or wasted effort when behaviour changes from one environment to another.
  • Better testability – you can more easily prove that your application does what it should do if its behaviour is entirely defined by the code rather than by some external data.
  • Simpler ‘physical form’ of the system – a single deployable unit rather than one code unit and a database unit. Among other things, this leads to easier deployments – no, or at least less frequent, need for database updates.

Of course, this idea doesn’t apply to all kinds of configuration options. It’s useful primarily for those that change the system behaviour – feature toggles, business rules for data normalisation, URL rewrite rules, that sort of thing. Data such as the addresses of downstream services, databases (!), etc, of course needs to be configurable on a per-environment basis rather than hardwired into the build.

This is yet another (though pretty minor) reason to work towards making frequent releases easy and painless: the possibility of a change in architecture and process that will allow you to spend less time doing regression testing and also helps speed up the deployments themselves.

Advertisements

, , ,

Leave a comment

Bygg – Executing the Build

This is the third post about Bygg – I’ve been fortunate enough to find some time to do some actual implementation, and I now have a version of the tool that can do the following:

First, read a configuration of plugins to use during the build:

public class PluginConfiguration {
  public static Plugins plugins() {
    return new Plugins() {
      public List plugins() {
        return Arrays.asList(
          new ArtifactVersion(ProjectArtifacts.BYGG_TEST_PLUGIN, "1.0-SNAPSHOT"),
          new ArtifactVersion(ProjectArtifacts.GUICE, "2.0"));
      }
    };
  }
}

Second, use that set of plugins to compile a project configuration:

public class Configuration {
  public static ByggConfiguration configuration() {
    return new ByggConfiguration() {
      public TargetDAG getTargetDAG() {
        return TargetDAG.DEFAULT
           .add("plugin")                  // defines the target name when executing
           .executor(new ByggTestPlugin()) // indicates what executing the target means
           .requires("test")               // means it won't be run until after "test"
           .build();
      }
    };
  }
}

Third, actually execute that build – although in the current version, none of the target executors have an actual implementation, so all they do is create a file with their name and the current time stamp under the target directory. The default build graph that is implemented contains targets that pretend to assemble the classpaths (placeholder for downloading any necessary dependencies) for the main code and the test code, targets that compile the main and test code, a target that runs the tests, and a target that pretends to create a package. As the sample code above hints, I’ve got three projects: Bygg itself, a dummy plugin, and a test project whose build requires the test plugin to be compiled and executed.

Fourth – with no example – cleaning up the target directory. This is the only feature that is fully implemented, being of course a trivial one. On my machine, running a clean in a tiny test project is 4-5 times faster using Bygg than Maven (taking 0.4 to 0.5 seconds of user time as compared to more than 2 for Maven), so thus far, I’m on target with regard to performance improvements. A little side note on cleaning is that I’ve come to the conclusion that clean isn’t a target. You’re never ever going to want to deploy a ‘clean’, or use it for anything. It’s an optional step that might be run before any actual target. To clarify that distinction, you specify targets using their names as command line arguments, but cleaning using -c or –clean:


bygg.sh -c compile plugin

As is immediately obvious, there’s a lot of rough edges here. The ones I know I will want to fix are:

  • Using annotations (instead of naming conventions) and generics (for type safety) in the configuration classes – I’m compiling and loading the configuration files using a library called Janino, which has an API that I think is great, but which by default only supports Java 4. There’s a way around it, but it seems a little awkward, so I’m planning on stealing the API design and putting in a front for JavaCompiler instead.
  • Updating the returned classes (Plugins and ByggConfiguration), as today they only contain a single element. Either they should be removed, or perhaps they will need to become a little more powerful.
  • Changing the names of stuff – TargetDAG especially is not great.
  • There’s a lot of noise, much of which is due to Java as a language, but some of which can probably be removed. The Plugins example above is 11 lines long, but only 2 lines contain useful information – and that’s not counting import statements, etc. Of course, since the number of ‘noise lines’ is pretty much constant, with realistically large builds, the signal to noise ratio will improve. Even so, I’d like it to be better.
  • I’m pretty sure it’s a good idea to move away from saying “this target requires target X” to define the order of execution towards something more like “this target requires the compiled main sources”. But there may well be reasons why you would want to keep something like “requires” or “before” in there – for instance, you might want to generate a properties file with information collected from version control and CI system before packaging your artifact. Rather than changing the predefined ‘package’ target, you might want to just say ‘run this before package’ and leave the file sitting in the right place in the target directory. I’m not quite sure how best to deal with that case yet – there’s a bit more discussion of this a little later.

Anyway, all that should be done in the light of some better understanding of what is actually needed to build something. So before I continue with improving the API, I want to take a few more steps on the path of execution.

As I’ve mentioned in a previous post, a build configuration in Bygg is a DAG (Directed Acyclic Graph). A nice thing about that is that it opens up the possibility of executing independent paths on the DAG concurrently. Tim pointed out to me that that kind of concurrent execution is an established idea called Dataflow Concurrency. In Java terms, Dataflow Concurrency essentially boils down to communicating all shared mutable state via Futures (returned by Callables executing the various tasks). What’s interesting about the Bygg version of Dataflow Concurrency is that the ‘Dataflow Variables’ can and will be parameters of the Callables executing tasks, rather than being hard-coded as is typical in the Dataflow Concurrency examples I’ve seen. So the graph will exist as a data entity as opposed to being hardwired in the code. This means that deadlock detection is as simple as detecting cycles in the graph – and since there is a requirement that the build configuration must be a DAG, builds will be deadlock free. In general, I think the ability to easily visualise the exact shape of the DAG of a build is a very desirable thing in terms of making builds easily understandable, so that should probably be a priority when continuing to improve the build configuration API.

Another idea I had from the reference to dataflow programming is that the canonical example of dataflow programming is a spreadsheet, where an update in one cell trickles through into updates of other cells that contain formulas that refer to the first one. That example made me change my mind about how different targets should communicate their results to each other. Initially, I had been thinking that most of the data that needs to be passed on from one target to the next should be implicitly located in a well-defined location on disk. So the test compiler would leave the test classes in a known place where the test runner knows to look for them. But that means loading source files into memory to compile them, then writing the class files to disk, then loading them into memory again. That’s a lot of I/O, and I have the feeling that I/O is often one of the things that slows builds down the most. What if there would be a dataflow variable with the classes instead? I haven’t yet looked in detail at the JavaFileManager interface, but it seems to me that it would make it possible to add an in-memory layer in front of the file system (in fact, I think that kind of optimisation is a large part of the reason why it exists). So it could be a nice optimisation to make the compiler store files in memory for test runners, packagers, etc., to pick up without having to do I/O. There would probably have to be a target (or something else, maybe) that writes the class files to disk in parallel with the rest of the execution, since the class files are nice to have as an optimisation for the next build – only recompiling what is necessary. But that write doesn’t necessarily have to slow down the test execution. All that is of course optimisation, so the first version will just use a plain file system-based JavaFileManager implementation. Still, I think it is a good idea to only have a very limited number of targets that directly access the file system, in order to open up for that kind of optimisation. The remainder of the targets should not be aware of the exact structure of the target directory, and what data is stored there.

I’m hoping to soon be able to find some more time to try these ideas out in code. It’ll be interesting to see how hard it is to figure out a good way to combine abstracting away the ‘target’ directory with full freedom for plugins to add stuff to the build package and dataflow concurrency variables.

, , ,

Leave a comment

Bygg – Better Dependency Management

I’ve said before that the one thing that Maven does amazingly well is dependency management. This post is about how to do it better.

Declaring and Defining Dependencies

In Maven, you can declare dependencies using a <dependencyManagement /> section in either the current POM or some parent POM, and then use them – include them into the current build – using a <dependencies /> section. This is very useful because it allows you to define in a single place which version of some dependency should be used for a set of related modules. It looks something like:

<dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>com.shopzilla.site2.service</groupId>
      <artifactId>common</artifactId>
      <version>${service.common.version}</version>
    </dependency>
    <dependency>
      <groupId>com.shopzilla.site2.service</groupId>
      <artifactId>common-web</artifactId>
      <version>${service.common.version}</version>
    </dependency>
   <dependency>
       <groupId>com.shopzilla.site2.core</groupId>
       <artifactId>core</artifactId>
       <version>${core.version}</version>
     </dependency>
   </dependencies>
 </dependencyManagement>

  <dependencies>
   <dependency>
     <groupId>com.shopzilla.site2.service</groupId>
     <artifactId>common</artifactId>
   </dependency>
   <dependency>
     <groupId>com.shopzilla.site2.core</groupId>
     <artifactId>core</artifactId>
   </dependency>
 </dependencies>

To me, there are a couple of problems here:

  1. Declaring a dependency looks pretty much identical to using one – the only difference is that the declaration is enclosed inside a <dependencyManagement /> section. This makes it hard to know what it is you’re looking at if you have a large number of dependencies – is this declaring a dependency or actually using it?
  2. It’s perfectly legal to add a <version/> tag in the plain <dependencies /> section, which will happen unless all the developers touching the POM a) understand the distinction between the two <dependencies /> sections, and b) are disciplined enough to maintain it.

For large POMs and POM hierarchies in particular, the way to define shared dependencies becomes not only overly verbose but also hard to keep track of. I think it could be made much easier and nicer in Bygg, something along these lines:


// Core API defined in Bygg proper
public interface Artifact {
   String getGroupId();
   String getArtifactId();
}

// ----------------------------------------------

// This enum is in some shared code somewhere - note that the pattern of declaring artifacts
// using an enum could be used within a module as well if desired. You could use different enums
// to define, for instance, sets of artifacts to use when talking to databases, or when developing
// a web user interface, or whatever.
public enum MyArtifacts implements Artifact {
    SERVICE_COMMON("com.shopzilla.site2.service", "common"),
    CORE("com.shopzilla.site2.core", "core");
    JUNIT("junit", "junit");
}

// ----------------------------------------------

// This can be declared in some shared code as well, probably but not
// necessarily in the same place as the enum. Note that the type of
// the collection is Collection, indicating that it isn't ordered.
public static final Collection ARTIFACT_VERSIONS = ImmutableList.of(
        new ArtifactVersion(SERVICE_COMMON, properties.get("service.common.version")),
        new ArtifactVersion(CORE, properties.get("core.version")),
        new ArtifactVersion(JUNIT, "4.8.1"));

// ----------------------------------------------

// In the module to be built, define how the declared dependencies are used.
// Here, ordering might be significant (it indicates the order of artifacts on the
// classpath - the reason for the 'might' is I'm not sure if this ordering can carry
// over into a packaged artifact like a WAR).
List moduleDependencies = ImmutableList.of(
    new Dependency(SERVICE_COMMON, EnumSet.of(MAIN, PACKAGE)),
    new Dependency(CORE, EnumSet.of(MAIN, PACKAGE)),
    new Dependency(JUNIT, EnumSet.of(TEST)));

The combination of a Collection of ArtifactVersions and a List of Dependency:s is then used by the classpath assembler target to produce an actual classpath for use in compiling, running, etc. Although the example code shows the Dependency:s as a plain List, I kind of think that there may be value in not having the actual dependencies be something else. Wrapping the list in an intelligent object that gives you filtering options, etc., could possibly be useful, but it’s a bit premature to decide about that until there’s code that makes it concrete.

The main ideas in the example above are:

  1. Using enums for the artifact identifiers (groupId + artifactId) gives a more succinct and harder-to-misspell way to refer to artifacts in the rest of the build configuration. Since you’re editing this code in the IDE, finding out exactly what the artifact identifier means (groupId + artifactId) is as easy as clicking on it while pressing the right key.
  2. If the build configuration is done using regular Java code, shared configuration items can trivially be made available as Maven artifacts. That makes it easy to for instance have different predefined groups of related artifacts, and opens up for composed rather than inherited shared configurations. Very nice!
  3. In the last bit, where the artifacts are actually used in the build, there is an outline of something I think might be useful. I’ve used Maven heavily for years, and something I’ve never quite learned how they work is the scopes. It’s basically a list of strings (compile, test, provided, runtime) that describe where a certain artifact should be used. So ‘compile’ means that the artifact in question will be used when compiling the main source code, when running tests, when executing the artifact being built and (if it is a WAR), when creating the final package. I think it would be far simpler to have a set of flags indicating which classpaths the artifact should be included in. So MAIN means ‘when compiling and running the main source’, TEST is ditto for the test source, and PACKAGE is ‘include in the final package’, and so on. No need to memorise what some scope means, you can just look at the set of flags.

Another idea that I think would be useful is adding an optional Repository setting for an Artifact. With Maven, you can add repositories in addition to the default one (Maven Central at Ibiblio). You can have as many as you like, which is great for some artifacts that aren’t included in Maven Central. However, adding repositories means slowing down your build by a wide margin, as Maven will check each repository defined in the build for updates to each snapshot version of an artifact. Whenever I add a repository, I do that to get access to a specific artifact. Utilising that fact by having two kinds of repositories – global and artifact-specific, maybe – should be simple and represent a performance improvement.

Better Transitive Conflict Resolution

Maven allows you to not have to worry about transitive dependencies required by libraries you include. This is, as I’ve argued before, an incredibly powerful feature. But the thing is, sometimes you do need to worry about those transitive dependencies: when they introduce binary incompatibilities. A typical example is something I ran into the other day, where a top-level project used version 3.0.3 of some Spring jars  (such as ‘spring-web’), while some shared libraries eventually included another version of Spring (the  ‘spring’ artifact, version 2.5.6). Both of these jars contain a definition of org.springframework.web.context.ConfigurableWebApplicationContext, and they are incompatible. This leads to runtime problems (in other words, the problem is visible too late; it should be detected at build time), and the only way to figure that out is to recognise the symptoms of the problem as a “likely binary version mismatch”, then use mvn dependency:analyze to figure out possible candidates and add exclude rules like this to your POM:

<dependencyManagement>
  <dependencies>
    <dependency>
       <groupId>com.shopzilla.site2.service</groupId>
       <artifactId>common</artifactId>
       <version>${service.common.version}</version>
       <exclusions>
          <exclusion>
            <groupId>org.springframework</groupId>
            <artifactId>spring</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.collections</groupId>
            <artifactId>google-collections</artifactId>
          </exclusion>
        </exclusions>
      </dependency>
      <dependency>
        <groupId>com.shopzilla.site2.core</groupId>
        <artifactId>core</artifactId>
        <version>${core.version}</version>
        <exclusions>
          <exclusion>
            <groupId>org.springframework</groupId>
            <artifactId>spring</artifactId>
          </exclusion>
          <exclusion>
            <groupId>com.google.collections</groupId>
            <artifactId>google-collections</artifactId>
          </exclusion>
        </exclusions>
      </dependency>
  </dependencies>
</dependencyManagement>

As you can tell from the example (just a small part of the POM) that I pasted in, I had a similar problem with google-collections. The top level project uses Guava, so binary incompatible versions included by the dependencies needed to be excluded. The problems here are:

  1. It’s painful to figure out what libraries cause the conflicts – sometimes, you know or can easily guess (like the fact that different versions of Spring packages can clash), but other times you need to know something a little less obvious (like the fact that Guava has superseded google-collections, something not immediately clear from the names). The tool could just tell you that you have binary incompatibilities on your classpath (I actually submitted a patch to the Maven dependency plugin to fix that, but it’s been stalled for 6 months).
  2. Once you’ve figured out what causes the problem, it’s a pain to get rid of all the places it comes from. The main tool at hand is the dependency plugin, and the way to figure out where dependencies come from is mvn dependency:tree. This lets you know a single source of a particular dependency. So for me, I wanted to find out where the spring jar came from – that meant running mvn dependency:tree, adding an exclude, running it again to find where else the spring jar was included, adding another exclude, and so on. This could be so much easier. And since it could be easier, it should be.
  3. What’s more, the problems are sometimes environment-dependent, so you’re not guaranteed that they will show up on your development machine. I’m not sure about the exact reasons, but I believe that there are differences in the order in which different class loaders load classes in a WAR. This might mean that the only place you can test if a particular problem is solved or not is your CI server, or some other environment, which again adds pain to the process.
  4. The configuration is rather verbose and you need to introduce duplicates, which makes your build files harder to understand at a glance.

Apart from considering binary incompatibilities to be errors (and reporting on exactly where they are found), here’s how I think exclusions should work in Bygg:


 dependencies.exclude().group("com.google.collections").artifact("google-collections")
          .exclude().group("org.springframework").artifact("spring.*").version(except("3.0.3"));

Key points above are:

  1. Making excludes a global thing, not a per-dependency thing. As soon as I’ve identified that I don’t want spring.jar version 2.5.6 in my project, I know I don’t want it from anywhere at all. I don’t care where it comes from, I just don’t want it there! I suppose there is a case for saying “I trust the core library to include google-collections for me, but not the common-web one”, so maybe global excludes aren’t enough. But they would certainly have helped me tremendously a lot of the times I’ve had to use Maven exclusions, and I can’t think of a case where I’ve actually wanted specifically to have an artifact-specific exclusion.
  2. Defining exlusion filters using a fluent API that includes regular expressions. With Spring in particular, you want to make sure that all your jars have the same version. It would be great to be able to say that you don’t want anything other than that.

Build Java with Java?!

I’ve gone through different phases when thinking about using Java to configure builds rather than XML. First, I thought “it’s great, because it allows you to more easily use the standard debugger for the builds and thereby resolve Maven’s huge documentation problem”. But then I realised that the debugging is enabled by ensuring that the IDE has access to the source code of the build tool and plugins that execute the build, and that how you configure it is irrelevant. So then I thought that using Java to configure is pretty good anyway, because it means developers won’t need to learn a new language (as with Buildr or Raven), and that IDE integration is a lot easier. The IDE you use for your regular Java programming wouldn’t need to be taught anything specific to deal with some more Java code. I’ve now come to the conclusion that DSL-style configuration APIs, and even more, using the standard engineering principles for sharing and reusing code for build configurations is another powerful argument in favour of using Java in the build configuration. So I’ve gone from “Java configuration is key”, to “Java configuration is OK, but not important” to “Java configuration is powerful”.

, , ,

Leave a comment

Bygg – Ideas for a Better Maven

In a previous post, I outlined some objectives for a better Maven. In this post, I want to talk about some ideas for how to achieve those objectives.

Build Java with Java

The first idea is that the default modus operandi should be that builds are runnable as a regular Java application from inside the IDE, just like the application that is being built. The IDE integration of the build tool should ensure that source code for the build tool and its core plugins is attached so that the developer can navigate through the build tool source. A not quite necessary part of this is that build configuration should be done using Java rather than XML, Ruby, or something else. This gives the following benefits:

  • Troubleshooting the build is as easy as just starting it in the debugger and setting break points. To understand how to use a plugin, you can navigate to the source implementing it from your build configuration. I think this is potentially incredibly valuable. If it is possible to integrate with IDEs to such an extent that the full source code of the libraries and plugins that are used in the build is available, that means that stepping through and debugging builds is as easy as for any library used in the product being built. And harnessing the highly trained skills that most Java developers have developed for understanding how a third-party library works is something that should make the build system much more accessible compared to having to rely on documentation.
  • It can safely be assumed that Java developers are or at least want to be very proficient in Java. Using a real programming language for build configurations is very advantageous compared to using XML, especially for those occasions when you have to customise your build a little extra. (This should be the exception rather than the rule as it increases the risk of feature creep and reduces standardisation.)
  • The IDEs that Java developers use can be expected to be great tools for writing Java code. Code completion, Javadoc popups, instant navigation to the source that implements a feature, etc. This should reduce the complexity of IDE integration.
  • It opens up for very readable DSL-style APIs, which should reduce the build script complexity and increase succinctness. Also, using Java means you could instantly see which configuration options are available for the thing you’re tweaking at the moment (through method completion, checking values of enums, etc., etc.).

There are some drawbacks, such as having to figure out a way to bootstrap the build (the build configuration needs to be compiled before a build can be run), and the fact that you have to provide a command-line tool anyway for continuous integration, etc. But I don’t think those problems are hard enough to be significant.

I first thought of the idea of using Java for the build configuration files, and that that would be great. But as I’ve been thinking about it, I’ve concluded that the exact format of the configuration files is less important than making sure that developers can navigate through the build source code in exactly the same way as through any third party library. That is something I’ve wanted to do many times when having problems with Maven builds, but it’s practically impossible today. There’s no reason why it should be harder to troubleshoot your build than your application.

Interlude: Aspects of Build Configurations

Identifying and separating things that need to be distinct and treated differently is one of the hard things to do when designing any program, and one of the things that has truly great effect if you get it right. So far, I’ve come up with a few different aspects of a build, namely:

  • The project or artifact properties – this includes things such as the artifact id, group id, version etc., and can also include useful information such as a project description, SCM links, etc.
  • The project’s dependencies – the third-party libraries that are needed in the build; either at compile time, test time or package time.
  • The build properties – any variables that you have in your build. A lot of times, you want to be able to have environment-specific variables that you use, or to refer to build-specific variables such as the output directory for compiled classes. The distinction between variables that may be overridden on a per-installation basis from variables that are determined during the build may mean that there are more than one kind of properties.
  • The steps that are taken to complete the build – compilations, copying, zipping, running tests, and so on.
  • The things that execute the various steps – the compiler, the test executor, a plugin that helps generate code from some type of resources, etc.

The point of separating these aspects of the build configuration is that it is likely that you’ll want to treat them differently. In Maven, almost everything is done in a single POM file, which grows large and hard to get an overview of, and/or in a settings.xml file that adds obscurity and reduces build portability. An example of a problem with bunching all the build configuration data into a single file is IntelliJ’s excellent Maven plugin, which wants to reimport the pom.xml with every change you make. Reimporting takes a lot of time on my chronically overloaded Macbook (well, 5-15 seconds, I guess – far too much time). It’s necessary because if I’ve changed the dependencies, IntelliJ needs to update its classpath.  The thing is, I think more than 50% or even 70% of the changes I make to pom files don’t affect the dependencies. If the dependencies section were separate from the rest of the build configuration, reimporting could be done only when actually needed.

I don’t think this analysis has landed yet, it feels like some pieces or nuances are still missing. But it helps as background for the rest of the ideas outlined in this post. The steps taken during the build, and the order in which they should be taken, are separate from the things that do something during each step.

Non-linear Build Execution

The second idea is an old one: abandoning Maven’s linear build lifecycle and instead getting back to the way that make does things (which is also Ant’s way). So rather than having loads of predefined steps in the build, it’s much better to be able to specify a DAG of dependencies between steps that defines the order of execution. This is better for at least two reasons: first, it’s much easier to understand the order of execution if you say “do A before B” or “do B after A” in your build configuration than if you say “do A during the process-classes phase and B during the generate-test-sources phase”. And second, it opens up the door to do independent tasks in parallel, which in turn creates opportunities for performance improvements. So for instance, it could be possible to download dependencies for the test code in parallel with the compilation of the production code, and it should be possible to zip up the source code JAR file at the same time as JavaDocs are generated.

What this means in concrete terms is that you would write something like this in your build configuration file:


  buildSteps.add(step("copyPropertiesTemplate")
         .executor(new CopyFile("src/main/template/properties.template",
                                "${OUTPUT_DIR}/properties.txt"))
         .before("package"));

Selecting what to build would be done via the build step names – so if all you wanted to do was to copy the properties template file, you would pass “copyPropertiesTemplate” to the build. The tool would look through the build configuration and in this case probably realise that nothing needs to be run before that step, so the “copyPropertiesTemplate” step would be all that was executed. If, on the other hand, the user stated that the “package” step should be executed, the build tool would discover that lots of things have to be done before – not only “copyPropertiesTemplate” but also “performCoverageChecks”, which in turn requires “executeTests”, and so on.

As the example shows, I would like to add a feature to the make/Ant version: specifying that a certain build step should happen before another one. The reason is that I think that the build tool should come with a default set of build steps that allow the most common build tasks to be run with zero configuration (see below for more on that). So you should always be able to say “compile”, or “test”, and that should just work as long as you stick the the conventions for where you store your source code, your tests, etc. This makes it awkward for a user to define a build step like the one above in isolation, and then after that have to modify the pre-existing “package” step to depend on the “copyPropertiesTemplate” one.

In design terms, there would have to be some sort of BuildStep entity that has a unique name (for the user interface), a set of predecessors and successors, and something that should be executed. There will also have to be a scheduler that can figure out a good order to execute steps in. I’ve made some skeleton implementations of this, and it feels like a good solution that is reasonably easy to get right. One thing I’ve not made up my mind about is the name of this entity – the two main candidates are BuildStep and Target. Build step explains well what it is from the perspective of the build tool, while Target reminds you of Ant and Make and focuses the attention on the fact that it is something a user can request the tool to do. I’ll use build step for the remainder of this post, but I’m kind of thinking that target might be a better name.

Gradual Buildup of DI Scope

Build steps will need to communicate results to one another. Some of this will of necessity be done via the file system – for instance, the compilation step will leave class files in a well-defined place for the test execution and packaging steps to pick up. Other results should be communicated in-memory, such as the current set of build properties and values, or the exact classpath to be used during compilation. The latter should be the output of some assembleClassPath step that checks or downloads dependencies and provides an exact list of JAR files and directories to be used by the compiler. You don’t want to store that sort of thing on the file system.

In-memory parameters should be injected into the executors of subsequent build steps that need them. This means that the build step executors will be gradually adding stuff to the set of objects that can be injected into later executors. A concrete implementation of this that I have been experimenting with is using hierarchical Guice injectors to track this. That means that each step of the build returns a (possibly empty) Guice module, which is then used to create an injector that inherits all the previous bindings from preceding steps. I think that works reasonably well in a linear context, but that merging injectors in a more complex build scenario is harder. A possibly better solution is to use the full set of modules used and created by previous steps to create a completely new injector at the start of each build step. Merging is then simply taking the union of the sets of modules used by the two merging paths through the DAG.

This idea bears some resemblance to the concept of dynamic dependency injection, but it is different in that there are parallel contexts (one for each concurrently executing path) that are mutating as each build step is executed.

I much prefer using DI to inject dependencies into plugins or build steps over exposing the entire project configuration + state to plugins, for all the usual reasons. It’s a bit hard to get right from a framework perspective, but I think it should help simplify the plugins and keep them isolated from one another.

Core Build Components Provided

One thing that Maven does really well is to support simple builds. The minimal pom is really very tiny. I think this is great both in terms of usability and as an indication of powerful build standardisation/strong conventions. In the design outlined in this post, the things that would have to come pre-configured with useful defaults are a set of build steps with correct dependencies and associated executors. So there would be a step that assembles the classpath for the main compilation, another one that assembles the classpath for the test compilation, yet another one that assembles the classpath for the test execution and probably even one that assembles the classpath to be used in the final package. These steps would probably all make use of a shared entity that knows how to assemble classpaths, and that is configured to know about a set of repositories from which it can download dependencies.

By default, the classpath assembler would know about one or a few core repositories (ibiblio.org for sure). Most commercial users will hopefully have their own internal Maven repositories, so it needs to be possible to tell the classpath assembler about these. Similarly, the compiler should have some useful defaults for source version, file encodings, etc., but they should be possible to override in a single place and then apply to all steps that use them.

Of course, the executors (classpath assembler, compiler, etc.) would be shared by build steps by default, but they shouldn’t necessarily be singletons – if one wanted to compile the test code using a different set of compiler flags, one could configure the build to have two compiler instances with different parameters.

The core set of build steps and executors should at a minimum allow you to build, test and deploy (in the Maven sense) a JAR library. Probably, there should be more stuff that is considered to be core than just that.

Naming the Tool

The final idea is a name – Bygg. Bygg is a Swedish word that means “build” as in “build X!”, not as in “a build” or “to build” (a verb in imperative form in other words). It’s probably one letter too long, but at least it’s not too hard to type the second ‘g’ when you’ve already typed the first. It’s got the right number of syllables and it means the right thing. It’s even relatively easy to pronounce if you know English (make it sound like “big” and you’ll be fine), although of course you have to have Scandinavian background to really pronounce the  “y” right.

That’s more than enough word count for this post. I have some more ideas about dependency management, APIs and flows in the tool, but I’ll have to save them for another time. Feel free to comment on these ideas, particularly about areas where I am missing the mark!

, ,

5 Comments

Objectives for A Better Maven

My friend Josh Slack made me aware of this post, by a guy (Kent Spillner) who is totally against Maven in almost every way. As I’ve mentioned before, I think Maven is the best tool out there for Java builds, so of course I like it better than Kent does. Still, there’s no doubt he has some points that you can’t help agreeing with. Reading his post made me think (once again) about what is great and not so great about Maven, and also of some ideas about how to fix the problems whilst retaining the great stuff (edit: I’ve started outlining these ideas here, with more to follow).

First, some of the things that are great:

  1. Dependency Management – I would go so far as to argue that Maven has done more to enable code reuse than anything else that is touted as a ‘reusability paradigm’ (such as OO itself). Before Maven and its repositories, you had to manually add every single dependency and their transitive requirements into each project, typically even into your source repository. The amount of manual effort to upgrade from one version of a library, and its transitive dependencies, means the optimal size of a library is quite large, making them unfocused and bloated. What’s more, it also means that library designers have a strong need to reduce the number of things they allow themselves to depend on, which reduces the scope for code reuse. With Maven, libraries can be more focused as it is effortless to have a deep dependency tree. At Shopzilla, our top-level builds typically include 50-200 dependencies. Imagine adding these to your source repository and keeping them up to date with every change – completely impossible!
  2. Build standardisation. The first sentence in Kent Spillner’s post is “The best build tool is the one you write yourself”. That’s probably true from the perspective of a single project, but with a larger set of projects that are collaboratively owned by multiple teams of developers, that idea breaks quickly. Again, I’ll use Shopzilla as an example – we have more than 100 Git repositories with Java code that are co-owned by 5-6 different teams. This means we must have standardised builds, or we would waste lots of time due to having to learn about custom builds for each project. Any open source project exists in an even larger ecosystem; essentially a global one. So unless you know that the number of developers who will be building your project is always going to be small, and that these developers will only have to work with a small number of projects, your build should be “mostly declarative” and as standardised as you can make it.
  3. The wealth of plugins that allow you to do almost any build-related task. This is thanks to the focus on a plugin-based architecture right from the get-go.
  4. The close integration with IDEs that makes it easier (though not quite painless) to work with it.

Any tool that would improve on Maven has to at least do equally well on those four counts.

To get a picture of the opportunities for improvement, here’s my list of Maven major pain points:

  1. Troubleshooting is usually hard to extremely hard. When something breaks, you get very little help from Maven to figure out what it is. Enabling debug level logging on the build makes it verbose to the point of obscuring the error. If you, like me, like to use the source code to find out what is happening, it is difficult to find because you will have to jump from plugin to plugin, and most plugins have different source repositories.
  2. Even though there is a wealth of plugins that allow you to do almost anything, it is usually unreasonably hard to a) find the right plugin and b) figure out how to use it. Understanding what Maven and its plugins do is really hard, and the documentation is frequently sub-standard.
  3. A common complaint is the verbose XML configuration. That is definitely an issue, succinctness improves readability and ease of use.
  4. The main drawback of the transitive dependency management is the risk of getting incompatible versions of the same library, or even worse, class. There is very little built-in support for managing this problem in Maven (there is some in the dependency plugin, but actually using that is laborious). This means that it is not uncommon to have large numbers of ‘exclusion’ tags for some dependencies polluting your build files, and that you anyway tend to have lots of stuff that you never use in your builds.
  5. Maven is slow, there’s no doubt about that. It takes time to create various JVMs, to download files/check for updates, etc. Also, every build runs through the same build steps even if some of them are not needed.
  6. Builds can succeed or fail on different machines for unobvious reasons – typically, the problem is due to differing versions of SNAPSHOT dependencies being installed in the local repository cache. It can also be due to using different versions of Maven plugins.

There’s actually quite a lot more that could be improved, but those are probably my main gripes. When listing them like this, I’m surprised to note that despite all these issues, I still think Maven is the best Java build tool out there. I really do think it is the best, but there’s no doubt that there’s plenty of room to improve things. So I’ve found myself thinking about how I would go about building a better Maven. I am not sure if I’ll be able to actually find the time to implement it, but it is fascinating enough that I can’t let go of the idea. Here’s what I would consider a useful set of objectives for an improved Maven, in order of importance:

  1. Perfect interoperability with the existing Maven artifact management repository infrastructure. There is so much power and value in being able to with near-zero effort get access to pretty much any open source project in the Java world that it is vital to be able to tap into that. Note that the value isn’t just in managing dependencies in a similar way to how Maven does it, but actually reusing the currently available artifacts and repositories.
  2. Simplified troubleshooting. More and better consistency checks of various kinds and at earlier stages of the build. Better and more to the point reporting of problems. Great frameworks tend to make this a key part of the architecture from the get-go rather than add it on as an afterthought.
  3. A pluggable architecture that makes it easy to add custom build actions. This is one of Maven’s great success points so a new framework has to be at least as good. I think it could and should be even easier than Maven makes it.
  4. Encouraging but not enforcing standardised builds. This means sticking to the idea of “convention over configuration“. It also means that defining your build should be a “mostly declarative” thing, not an imperative thing. You should say “I want a JAR”, not “zip up the class files in directory X”. Programming your builds is sometimes a necessary evil so it must be possible, but it should be discouraged as it is a slippery slope that leads to non-standardised builds, which in turn means making it harder for anybody coming new to a project to get going with it.
  5. Great integration with IDEs, or at least support for the ability to create great integration with IDEs. This is a necessary part of giving programmers a workflow that never forces them out of the zone.
  6. Less verbose configuration. Not a show-stopper in my opinion, but definitely a good thing to improve.
  7. EDIT: While writing further posts on this topic, I’ve realised that there is one more thing that I consider very important: improving performance. Waiting for builds is a productivity-killing drag.

It’s good to specify what you want to do, but in some ways, it’s even better to specify things you’re not trying to achieve either because they’re not relevant or because you find them counter-productive. That gives a different kind of clarity. So here’s a couple of non-objectives:

  1. Using the same artifact management mechanism for build tool plugins as for project dependencies, the way Maven does. While there is some elegance to this idea, it also comes with a host of difficulties – unreproducible builds being the main one, and while earlier versions of Maven actively updated plugin versions most or all the time, Maven 3 now issues warnings if you haven’t specified the plugin versions for your build.
  2. Reimplementing all the features provided by Maven plugins. Obviously, trying to out-feature something so feature-rich as Maven would be impossible and limit the likelihood of success hugely. So one thing to do is to select a subset of build steps that represent the most common and/or most different things that are typically done in a build and then see how well the framework deals with that.
  3. Being compatible with Maven plugins. In a way, it would be great to be able for a new build tool to be able to use any existing Maven plugin. But being able to do that would limit the architectural options and increase the complexity of the new architecture to the point of making it unlikely to succeed.
  4. Reproducing the ‘project information’ as a core part of the new tool. Producing project information was one of the core goals of Maven when it was first created. I personally find that less than useful, and therefore not worth making into a core part of a better Maven. It should of course be easy to create a plugin that does this, but it doesn’t have to be a core feature.

I’ve got some ideas for how to build a build tool that meets or is likely to meet most of those objectives. But this post is already more than long enough, and I’d anyway like to stop here to ask for some feedback. Any opinions on the strengths, weaknesses and objectives outlined here?

,

6 Comments