Archive for April, 2012

Exceptional iterators using Guava

Some years ago, I came across a little gem of a blog post about using a class called AbstractIterator that I hadn’t heard of before. I immediately loved it, and started using it, and in general, the APIs defined in the JDK collections classes a lot more frequently. I’m still doing that today, but for the first time, I came across a situation where AbstractIterator (nowadays in Guava, not Google Collections, of course) let me down.

In the system I’m currently working on, we need to process tens of thousands of feeds that are generated ‘a little randomly’, often by people without deep technical skills. This means we cannot know for sure that entries in the feeds will be consistent; we may be able to parse the first four entries, but not the fifth, and so on. The quantity of feeds, combined with the fact that we need to be forgiving of less-than-perfect input means that we need to be able to continue ingesting data even if a few entries are bad.

Ideally, we wanted to write code similar to this:


Iterator iterator = feedParser.iterator();

while (iterator.hasNext()) {
  try {
    FeedEntry entry = iterator.next();

    // ... process the entry
  }
  catch (BadEntryException e) {
    // .. track the error and continue processing
  }
}

Now, we also wanted to use AbstractIterator as the base class for the parser to use, maybe something like:


class FeedEntryIterator extends AbstractIterator<FeedEntry> {
  protected FeedEntry computeNext() {
     if (noMoreInput()) {
        return endOfData();
     }

     try {
        return parseInput();
     }
     catch (ParseException e) {
        throw new BadEntryException(e);
     }
  }
}

This, however, doesn’t work for two reasons:

  1. The BadEntryException will be thrown during the execution of hasNext(), since the AbstractIterator class calls the computeNext() method to check if any more data is available, storing the result in an intermediate member field until the next() method is called.
  2. If the computeNext() method throws an exception, the AbstractIterator is left in a state FAILED, from which it cannot recover.

One option to work around this that I thought of was delegation. Something like this works for many scenarios:


public class ForwardingTransformingIterator<F, T> implements Iterator<T> {
    private final Iterator<F> source;
    private final Function<F, T> transformer;

    public ForwardingThrowingIterator(Iterator<F> source, Function<V, T> transformer) {
        this.delegate = delegate;
        this.transformer = transformer;
    }

    @Override
    public boolean hasNext() {
        return delegate.hasNext();
    }

    @Override
    public T next() {
        return transformer.apply(delegate.next());
    }
}

The transformation function could then safely throw exceptions for entries it fails to parse. There’s even a shorthand for the above code in Guava: Iterators.transform(). The only issue is that the delegate iterator needs to return chunks of data that are the same size as what the parser/transform needs. So if the parser needs to be, for instance, something like a SAX parser, you’re in trouble. When I had gotten this far, I figured the only solution was to modify the AbstractIterator class so that it can deal with exceptions directly. The source is a little too long to include in an already source-heavy post, but there’s a gist here. The essence of the change is to make the base class store exceptions of a particular type that are thrown by computeNext(), and re-throw them instead of returning a value when next() is called. It kind of works (we’re using this implementation right now), but it would be nice to find a better solution.

I’ve created a Guava issue for this, so if you think this would be useful, go ahead and vote for it. And if you have suggestions for how to improve the code, feel free to make them here or at code.google.com!

, ,

Leave a comment

Hardcode Behaviour!

Some years ago, when I was learning Git, I watched a presentation by Linus Torvalds, and in passing, he made one of those points that just fits with stuff you’ve been thinking but haven’t yet verbalised and so don’t fully understand. He was talking about how he had obsessed over the performance of some operation in Git (merges if I remember right), because with gradual improvements in performance, there’s a quantum leap where the pain of doing something goes away. And when it’s not painful, you can do it often and that can completely change the way you work, opening up avenues that used to be closed.

This thought can be generalised to a lot of areas – gradual improvements that suddenly lead to a change in what you can do. One such scenario that I’ve been thinking about a little lately is the pattern of using databases (as in something external to the code – XML files, properties files, whatever) for configuration data. The rationale for that is to make change easier and quicker, and the pattern comes out of a situation where the next release is some months away, but a database change can be done in minutes. These days, though, that situation should no longer apply when building web services. If you do things right, it should be possible to do the next release within minutes or at least hours, and this means that you can hardcode your some of your configuration data instead of using a database.

Hardcoding configurable options gives the following benefits:

  • Consistency across environments – with databases, there’s a risk (or a guarantee, more or less) that environments will differ. This will lead to surprises and/or wasted effort when behaviour changes from one environment to another.
  • Better testability – you can more easily prove that your application does what it should do if its behaviour is entirely defined by the code rather than by some external data.
  • Simpler ‘physical form’ of the system – a single deployable unit rather than one code unit and a database unit. Among other things, this leads to easier deployments – no, or at least less frequent, need for database updates.

Of course, this idea doesn’t apply to all kinds of configuration options. It’s useful primarily for those that change the system behaviour – feature toggles, business rules for data normalisation, URL rewrite rules, that sort of thing. Data such as the addresses of downstream services, databases (!), etc, of course needs to be configurable on a per-environment basis rather than hardwired into the build.

This is yet another (though pretty minor) reason to work towards making frequent releases easy and painless: the possibility of a change in architecture and process that will allow you to spend less time doing regression testing and also helps speed up the deployments themselves.

, , ,

Leave a comment