Archive for January, 2011

Bygg – Ideas for a Better Maven

In a previous post, I outlined some objectives for a better Maven. In this post, I want to talk about some ideas for how to achieve those objectives.

Build Java with Java

The first idea is that the default modus operandi should be that builds are runnable as a regular Java application from inside the IDE, just like the application that is being built. The IDE integration of the build tool should ensure that source code for the build tool and its core plugins is attached so that the developer can navigate through the build tool source. A not quite necessary part of this is that build configuration should be done using Java rather than XML, Ruby, or something else. This gives the following benefits:

  • Troubleshooting the build is as easy as just starting it in the debugger and setting break points. To understand how to use a plugin, you can navigate to the source implementing it from your build configuration. I think this is potentially incredibly valuable. If it is possible to integrate with IDEs to such an extent that the full source code of the libraries and plugins that are used in the build is available, that means that stepping through and debugging builds is as easy as for any library used in the product being built. And harnessing the highly trained skills that most Java developers have developed for understanding how a third-party library works is something that should make the build system much more accessible compared to having to rely on documentation.
  • It can safely be assumed that Java developers are or at least want to be very proficient in Java. Using a real programming language for build configurations is very advantageous compared to using XML, especially for those occasions when you have to customise your build a little extra. (This should be the exception rather than the rule as it increases the risk of feature creep and reduces standardisation.)
  • The IDEs that Java developers use can be expected to be great tools for writing Java code. Code completion, Javadoc popups, instant navigation to the source that implements a feature, etc. This should reduce the complexity of IDE integration.
  • It opens up for very readable DSL-style APIs, which should reduce the build script complexity and increase succinctness. Also, using Java means you could instantly see which configuration options are available for the thing you’re tweaking at the moment (through method completion, checking values of enums, etc., etc.).

There are some drawbacks, such as having to figure out a way to bootstrap the build (the build configuration needs to be compiled before a build can be run), and the fact that you have to provide a command-line tool anyway for continuous integration, etc. But I don’t think those problems are hard enough to be significant.

I first thought of the idea of using Java for the build configuration files, and that that would be great. But as I’ve been thinking about it, I’ve concluded that the exact format of the configuration files is less important than making sure that developers can navigate through the build source code in exactly the same way as through any third party library. That is something I’ve wanted to do many times when having problems with Maven builds, but it’s practically impossible today. There’s no reason why it should be harder to troubleshoot your build than your application.

Interlude: Aspects of Build Configurations

Identifying and separating things that need to be distinct and treated differently is one of the hard things to do when designing any program, and one of the things that has truly great effect if you get it right. So far, I’ve come up with a few different aspects of a build, namely:

  • The project or artifact properties – this includes things such as the artifact id, group id, version etc., and can also include useful information such as a project description, SCM links, etc.
  • The project’s dependencies – the third-party libraries that are needed in the build; either at compile time, test time or package time.
  • The build properties – any variables that you have in your build. A lot of times, you want to be able to have environment-specific variables that you use, or to refer to build-specific variables such as the output directory for compiled classes. The distinction between variables that may be overridden on a per-installation basis from variables that are determined during the build may mean that there are more than one kind of properties.
  • The steps that are taken to complete the build – compilations, copying, zipping, running tests, and so on.
  • The things that execute the various steps – the compiler, the test executor, a plugin that helps generate code from some type of resources, etc.

The point of separating these aspects of the build configuration is that it is likely that you’ll want to treat them differently. In Maven, almost everything is done in a single POM file, which grows large and hard to get an overview of, and/or in a settings.xml file that adds obscurity and reduces build portability. An example of a problem with bunching all the build configuration data into a single file is IntelliJ’s excellent Maven plugin, which wants to reimport the pom.xml with every change you make. Reimporting takes a lot of time on my chronically overloaded Macbook (well, 5-15 seconds, I guess – far too much time). It’s necessary because if I’ve changed the dependencies, IntelliJ needs to update its classpath.  The thing is, I think more than 50% or even 70% of the changes I make to pom files don’t affect the dependencies. If the dependencies section were separate from the rest of the build configuration, reimporting could be done only when actually needed.

I don’t think this analysis has landed yet, it feels like some pieces or nuances are still missing. But it helps as background for the rest of the ideas outlined in this post. The steps taken during the build, and the order in which they should be taken, are separate from the things that do something during each step.

Non-linear Build Execution

The second idea is an old one: abandoning Maven’s linear build lifecycle and instead getting back to the way that make does things (which is also Ant’s way). So rather than having loads of predefined steps in the build, it’s much better to be able to specify a DAG of dependencies between steps that defines the order of execution. This is better for at least two reasons: first, it’s much easier to understand the order of execution if you say “do A before B” or “do B after A” in your build configuration than if you say “do A during the process-classes phase and B during the generate-test-sources phase”. And second, it opens up the door to do independent tasks in parallel, which in turn creates opportunities for performance improvements. So for instance, it could be possible to download dependencies for the test code in parallel with the compilation of the production code, and it should be possible to zip up the source code JAR file at the same time as JavaDocs are generated.

What this means in concrete terms is that you would write something like this in your build configuration file:


  buildSteps.add(step("copyPropertiesTemplate")
         .executor(new CopyFile("src/main/template/properties.template",
                                "${OUTPUT_DIR}/properties.txt"))
         .before("package"));

Selecting what to build would be done via the build step names – so if all you wanted to do was to copy the properties template file, you would pass “copyPropertiesTemplate” to the build. The tool would look through the build configuration and in this case probably realise that nothing needs to be run before that step, so the “copyPropertiesTemplate” step would be all that was executed. If, on the other hand, the user stated that the “package” step should be executed, the build tool would discover that lots of things have to be done before – not only “copyPropertiesTemplate” but also “performCoverageChecks”, which in turn requires “executeTests”, and so on.

As the example shows, I would like to add a feature to the make/Ant version: specifying that a certain build step should happen before another one. The reason is that I think that the build tool should come with a default set of build steps that allow the most common build tasks to be run with zero configuration (see below for more on that). So you should always be able to say “compile”, or “test”, and that should just work as long as you stick the the conventions for where you store your source code, your tests, etc. This makes it awkward for a user to define a build step like the one above in isolation, and then after that have to modify the pre-existing “package” step to depend on the “copyPropertiesTemplate” one.

In design terms, there would have to be some sort of BuildStep entity that has a unique name (for the user interface), a set of predecessors and successors, and something that should be executed. There will also have to be a scheduler that can figure out a good order to execute steps in. I’ve made some skeleton implementations of this, and it feels like a good solution that is reasonably easy to get right. One thing I’ve not made up my mind about is the name of this entity – the two main candidates are BuildStep and Target. Build step explains well what it is from the perspective of the build tool, while Target reminds you of Ant and Make and focuses the attention on the fact that it is something a user can request the tool to do. I’ll use build step for the remainder of this post, but I’m kind of thinking that target might be a better name.

Gradual Buildup of DI Scope

Build steps will need to communicate results to one another. Some of this will of necessity be done via the file system – for instance, the compilation step will leave class files in a well-defined place for the test execution and packaging steps to pick up. Other results should be communicated in-memory, such as the current set of build properties and values, or the exact classpath to be used during compilation. The latter should be the output of some assembleClassPath step that checks or downloads dependencies and provides an exact list of JAR files and directories to be used by the compiler. You don’t want to store that sort of thing on the file system.

In-memory parameters should be injected into the executors of subsequent build steps that need them. This means that the build step executors will be gradually adding stuff to the set of objects that can be injected into later executors. A concrete implementation of this that I have been experimenting with is using hierarchical Guice injectors to track this. That means that each step of the build returns a (possibly empty) Guice module, which is then used to create an injector that inherits all the previous bindings from preceding steps. I think that works reasonably well in a linear context, but that merging injectors in a more complex build scenario is harder. A possibly better solution is to use the full set of modules used and created by previous steps to create a completely new injector at the start of each build step. Merging is then simply taking the union of the sets of modules used by the two merging paths through the DAG.

This idea bears some resemblance to the concept of dynamic dependency injection, but it is different in that there are parallel contexts (one for each concurrently executing path) that are mutating as each build step is executed.

I much prefer using DI to inject dependencies into plugins or build steps over exposing the entire project configuration + state to plugins, for all the usual reasons. It’s a bit hard to get right from a framework perspective, but I think it should help simplify the plugins and keep them isolated from one another.

Core Build Components Provided

One thing that Maven does really well is to support simple builds. The minimal pom is really very tiny. I think this is great both in terms of usability and as an indication of powerful build standardisation/strong conventions. In the design outlined in this post, the things that would have to come pre-configured with useful defaults are a set of build steps with correct dependencies and associated executors. So there would be a step that assembles the classpath for the main compilation, another one that assembles the classpath for the test compilation, yet another one that assembles the classpath for the test execution and probably even one that assembles the classpath to be used in the final package. These steps would probably all make use of a shared entity that knows how to assemble classpaths, and that is configured to know about a set of repositories from which it can download dependencies.

By default, the classpath assembler would know about one or a few core repositories (ibiblio.org for sure). Most commercial users will hopefully have their own internal Maven repositories, so it needs to be possible to tell the classpath assembler about these. Similarly, the compiler should have some useful defaults for source version, file encodings, etc., but they should be possible to override in a single place and then apply to all steps that use them.

Of course, the executors (classpath assembler, compiler, etc.) would be shared by build steps by default, but they shouldn’t necessarily be singletons – if one wanted to compile the test code using a different set of compiler flags, one could configure the build to have two compiler instances with different parameters.

The core set of build steps and executors should at a minimum allow you to build, test and deploy (in the Maven sense) a JAR library. Probably, there should be more stuff that is considered to be core than just that.

Naming the Tool

The final idea is a name – Bygg. Bygg is a Swedish word that means “build” as in “build X!”, not as in “a build” or “to build” (a verb in imperative form in other words). It’s probably one letter too long, but at least it’s not too hard to type the second ‘g’ when you’ve already typed the first. It’s got the right number of syllables and it means the right thing. It’s even relatively easy to pronounce if you know English (make it sound like “big” and you’ll be fine), although of course you have to have Scandinavian background to really pronounce the  “y” right.

That’s more than enough word count for this post. I have some more ideas about dependency management, APIs and flows in the tool, but I’ll have to save them for another time. Feel free to comment on these ideas, particularly about areas where I am missing the mark!

Advertisement

, ,

5 Comments

Objectives for A Better Maven

My friend Josh Slack made me aware of this post, by a guy (Kent Spillner) who is totally against Maven in almost every way. As I’ve mentioned before, I think Maven is the best tool out there for Java builds, so of course I like it better than Kent does. Still, there’s no doubt he has some points that you can’t help agreeing with. Reading his post made me think (once again) about what is great and not so great about Maven, and also of some ideas about how to fix the problems whilst retaining the great stuff (edit: I’ve started outlining these ideas here, with more to follow).

First, some of the things that are great:

  1. Dependency Management – I would go so far as to argue that Maven has done more to enable code reuse than anything else that is touted as a ‘reusability paradigm’ (such as OO itself). Before Maven and its repositories, you had to manually add every single dependency and their transitive requirements into each project, typically even into your source repository. The amount of manual effort to upgrade from one version of a library, and its transitive dependencies, means the optimal size of a library is quite large, making them unfocused and bloated. What’s more, it also means that library designers have a strong need to reduce the number of things they allow themselves to depend on, which reduces the scope for code reuse. With Maven, libraries can be more focused as it is effortless to have a deep dependency tree. At Shopzilla, our top-level builds typically include 50-200 dependencies. Imagine adding these to your source repository and keeping them up to date with every change – completely impossible!
  2. Build standardisation. The first sentence in Kent Spillner’s post is “The best build tool is the one you write yourself”. That’s probably true from the perspective of a single project, but with a larger set of projects that are collaboratively owned by multiple teams of developers, that idea breaks quickly. Again, I’ll use Shopzilla as an example – we have more than 100 Git repositories with Java code that are co-owned by 5-6 different teams. This means we must have standardised builds, or we would waste lots of time due to having to learn about custom builds for each project. Any open source project exists in an even larger ecosystem; essentially a global one. So unless you know that the number of developers who will be building your project is always going to be small, and that these developers will only have to work with a small number of projects, your build should be “mostly declarative” and as standardised as you can make it.
  3. The wealth of plugins that allow you to do almost any build-related task. This is thanks to the focus on a plugin-based architecture right from the get-go.
  4. The close integration with IDEs that makes it easier (though not quite painless) to work with it.

Any tool that would improve on Maven has to at least do equally well on those four counts.

To get a picture of the opportunities for improvement, here’s my list of Maven major pain points:

  1. Troubleshooting is usually hard to extremely hard. When something breaks, you get very little help from Maven to figure out what it is. Enabling debug level logging on the build makes it verbose to the point of obscuring the error. If you, like me, like to use the source code to find out what is happening, it is difficult to find because you will have to jump from plugin to plugin, and most plugins have different source repositories.
  2. Even though there is a wealth of plugins that allow you to do almost anything, it is usually unreasonably hard to a) find the right plugin and b) figure out how to use it. Understanding what Maven and its plugins do is really hard, and the documentation is frequently sub-standard.
  3. A common complaint is the verbose XML configuration. That is definitely an issue, succinctness improves readability and ease of use.
  4. The main drawback of the transitive dependency management is the risk of getting incompatible versions of the same library, or even worse, class. There is very little built-in support for managing this problem in Maven (there is some in the dependency plugin, but actually using that is laborious). This means that it is not uncommon to have large numbers of ‘exclusion’ tags for some dependencies polluting your build files, and that you anyway tend to have lots of stuff that you never use in your builds.
  5. Maven is slow, there’s no doubt about that. It takes time to create various JVMs, to download files/check for updates, etc. Also, every build runs through the same build steps even if some of them are not needed.
  6. Builds can succeed or fail on different machines for unobvious reasons – typically, the problem is due to differing versions of SNAPSHOT dependencies being installed in the local repository cache. It can also be due to using different versions of Maven plugins.

There’s actually quite a lot more that could be improved, but those are probably my main gripes. When listing them like this, I’m surprised to note that despite all these issues, I still think Maven is the best Java build tool out there. I really do think it is the best, but there’s no doubt that there’s plenty of room to improve things. So I’ve found myself thinking about how I would go about building a better Maven. I am not sure if I’ll be able to actually find the time to implement it, but it is fascinating enough that I can’t let go of the idea. Here’s what I would consider a useful set of objectives for an improved Maven, in order of importance:

  1. Perfect interoperability with the existing Maven artifact management repository infrastructure. There is so much power and value in being able to with near-zero effort get access to pretty much any open source project in the Java world that it is vital to be able to tap into that. Note that the value isn’t just in managing dependencies in a similar way to how Maven does it, but actually reusing the currently available artifacts and repositories.
  2. Simplified troubleshooting. More and better consistency checks of various kinds and at earlier stages of the build. Better and more to the point reporting of problems. Great frameworks tend to make this a key part of the architecture from the get-go rather than add it on as an afterthought.
  3. A pluggable architecture that makes it easy to add custom build actions. This is one of Maven’s great success points so a new framework has to be at least as good. I think it could and should be even easier than Maven makes it.
  4. Encouraging but not enforcing standardised builds. This means sticking to the idea of “convention over configuration“. It also means that defining your build should be a “mostly declarative” thing, not an imperative thing. You should say “I want a JAR”, not “zip up the class files in directory X”. Programming your builds is sometimes a necessary evil so it must be possible, but it should be discouraged as it is a slippery slope that leads to non-standardised builds, which in turn means making it harder for anybody coming new to a project to get going with it.
  5. Great integration with IDEs, or at least support for the ability to create great integration with IDEs. This is a necessary part of giving programmers a workflow that never forces them out of the zone.
  6. Less verbose configuration. Not a show-stopper in my opinion, but definitely a good thing to improve.
  7. EDIT: While writing further posts on this topic, I’ve realised that there is one more thing that I consider very important: improving performance. Waiting for builds is a productivity-killing drag.

It’s good to specify what you want to do, but in some ways, it’s even better to specify things you’re not trying to achieve either because they’re not relevant or because you find them counter-productive. That gives a different kind of clarity. So here’s a couple of non-objectives:

  1. Using the same artifact management mechanism for build tool plugins as for project dependencies, the way Maven does. While there is some elegance to this idea, it also comes with a host of difficulties – unreproducible builds being the main one, and while earlier versions of Maven actively updated plugin versions most or all the time, Maven 3 now issues warnings if you haven’t specified the plugin versions for your build.
  2. Reimplementing all the features provided by Maven plugins. Obviously, trying to out-feature something so feature-rich as Maven would be impossible and limit the likelihood of success hugely. So one thing to do is to select a subset of build steps that represent the most common and/or most different things that are typically done in a build and then see how well the framework deals with that.
  3. Being compatible with Maven plugins. In a way, it would be great to be able for a new build tool to be able to use any existing Maven plugin. But being able to do that would limit the architectural options and increase the complexity of the new architecture to the point of making it unlikely to succeed.
  4. Reproducing the ‘project information’ as a core part of the new tool. Producing project information was one of the core goals of Maven when it was first created. I personally find that less than useful, and therefore not worth making into a core part of a better Maven. It should of course be easy to create a plugin that does this, but it doesn’t have to be a core feature.

I’ve got some ideas for how to build a build tool that meets or is likely to meet most of those objectives. But this post is already more than long enough, and I’d anyway like to stop here to ask for some feedback. Any opinions on the strengths, weaknesses and objectives outlined here?

,

6 Comments