Archive for March, 2011

Bygg – Executing the Build

This is the third post about Bygg – I’ve been fortunate enough to find some time to do some actual implementation, and I now have a version of the tool that can do the following:

First, read a configuration of plugins to use during the build:

public class PluginConfiguration {
  public static Plugins plugins() {
    return new Plugins() {
      public List plugins() {
        return Arrays.asList(
          new ArtifactVersion(ProjectArtifacts.BYGG_TEST_PLUGIN, "1.0-SNAPSHOT"),
          new ArtifactVersion(ProjectArtifacts.GUICE, "2.0"));
      }
    };
  }
}

Second, use that set of plugins to compile a project configuration:

public class Configuration {
  public static ByggConfiguration configuration() {
    return new ByggConfiguration() {
      public TargetDAG getTargetDAG() {
        return TargetDAG.DEFAULT
           .add("plugin")                  // defines the target name when executing
           .executor(new ByggTestPlugin()) // indicates what executing the target means
           .requires("test")               // means it won't be run until after "test"
           .build();
      }
    };
  }
}

Third, actually execute that build – although in the current version, none of the target executors have an actual implementation, so all they do is create a file with their name and the current time stamp under the target directory. The default build graph that is implemented contains targets that pretend to assemble the classpaths (placeholder for downloading any necessary dependencies) for the main code and the test code, targets that compile the main and test code, a target that runs the tests, and a target that pretends to create a package. As the sample code above hints, I’ve got three projects: Bygg itself, a dummy plugin, and a test project whose build requires the test plugin to be compiled and executed.

Fourth – with no example – cleaning up the target directory. This is the only feature that is fully implemented, being of course a trivial one. On my machine, running a clean in a tiny test project is 4-5 times faster using Bygg than Maven (taking 0.4 to 0.5 seconds of user time as compared to more than 2 for Maven), so thus far, I’m on target with regard to performance improvements. A little side note on cleaning is that I’ve come to the conclusion that clean isn’t a target. You’re never ever going to want to deploy a ‘clean’, or use it for anything. It’s an optional step that might be run before any actual target. To clarify that distinction, you specify targets using their names as command line arguments, but cleaning using -c or –clean:


bygg.sh -c compile plugin

As is immediately obvious, there’s a lot of rough edges here. The ones I know I will want to fix are:

  • Using annotations (instead of naming conventions) and generics (for type safety) in the configuration classes – I’m compiling and loading the configuration files using a library called Janino, which has an API that I think is great, but which by default only supports Java 4. There’s a way around it, but it seems a little awkward, so I’m planning on stealing the API design and putting in a front for JavaCompiler instead.
  • Updating the returned classes (Plugins and ByggConfiguration), as today they only contain a single element. Either they should be removed, or perhaps they will need to become a little more powerful.
  • Changing the names of stuff – TargetDAG especially is not great.
  • There’s a lot of noise, much of which is due to Java as a language, but some of which can probably be removed. The Plugins example above is 11 lines long, but only 2 lines contain useful information – and that’s not counting import statements, etc. Of course, since the number of ‘noise lines’ is pretty much constant, with realistically large builds, the signal to noise ratio will improve. Even so, I’d like it to be better.
  • I’m pretty sure it’s a good idea to move away from saying “this target requires target X” to define the order of execution towards something more like “this target requires the compiled main sources”. But there may well be reasons why you would want to keep something like “requires” or “before” in there – for instance, you might want to generate a properties file with information collected from version control and CI system before packaging your artifact. Rather than changing the predefined ‘package’ target, you might want to just say ‘run this before package’ and leave the file sitting in the right place in the target directory. I’m not quite sure how best to deal with that case yet – there’s a bit more discussion of this a little later.

Anyway, all that should be done in the light of some better understanding of what is actually needed to build something. So before I continue with improving the API, I want to take a few more steps on the path of execution.

As I’ve mentioned in a previous post, a build configuration in Bygg is a DAG (Directed Acyclic Graph). A nice thing about that is that it opens up the possibility of executing independent paths on the DAG concurrently. Tim pointed out to me that that kind of concurrent execution is an established idea called Dataflow Concurrency. In Java terms, Dataflow Concurrency essentially boils down to communicating all shared mutable state via Futures (returned by Callables executing the various tasks). What’s interesting about the Bygg version of Dataflow Concurrency is that the ‘Dataflow Variables’ can and will be parameters of the Callables executing tasks, rather than being hard-coded as is typical in the Dataflow Concurrency examples I’ve seen. So the graph will exist as a data entity as opposed to being hardwired in the code. This means that deadlock detection is as simple as detecting cycles in the graph – and since there is a requirement that the build configuration must be a DAG, builds will be deadlock free. In general, I think the ability to easily visualise the exact shape of the DAG of a build is a very desirable thing in terms of making builds easily understandable, so that should probably be a priority when continuing to improve the build configuration API.

Another idea I had from the reference to dataflow programming is that the canonical example of dataflow programming is a spreadsheet, where an update in one cell trickles through into updates of other cells that contain formulas that refer to the first one. That example made me change my mind about how different targets should communicate their results to each other. Initially, I had been thinking that most of the data that needs to be passed on from one target to the next should be implicitly located in a well-defined location on disk. So the test compiler would leave the test classes in a known place where the test runner knows to look for them. But that means loading source files into memory to compile them, then writing the class files to disk, then loading them into memory again. That’s a lot of I/O, and I have the feeling that I/O is often one of the things that slows builds down the most. What if there would be a dataflow variable with the classes instead? I haven’t yet looked in detail at the JavaFileManager interface, but it seems to me that it would make it possible to add an in-memory layer in front of the file system (in fact, I think that kind of optimisation is a large part of the reason why it exists). So it could be a nice optimisation to make the compiler store files in memory for test runners, packagers, etc., to pick up without having to do I/O. There would probably have to be a target (or something else, maybe) that writes the class files to disk in parallel with the rest of the execution, since the class files are nice to have as an optimisation for the next build – only recompiling what is necessary. But that write doesn’t necessarily have to slow down the test execution. All that is of course optimisation, so the first version will just use a plain file system-based JavaFileManager implementation. Still, I think it is a good idea to only have a very limited number of targets that directly access the file system, in order to open up for that kind of optimisation. The remainder of the targets should not be aware of the exact structure of the target directory, and what data is stored there.

I’m hoping to soon be able to find some more time to try these ideas out in code. It’ll be interesting to see how hard it is to figure out a good way to combine abstracting away the ‘target’ directory with full freedom for plugins to add stuff to the build package and dataflow concurrency variables.

, , ,

Leave a comment