sixlegs.com / blog / java / optimizing-xml-parsing.html

Root Beer Logo Root Beer

Chris Nokleberg's Fizzy Weblog

October 2003
Su M Tu W Th F Sa
      1 2 3 4
6 7 8 9 10 11
12 13 14 15 17 18
19 20 21 22 23 24 25
26 27 29 30 31
Previous  |  Next  |  More...
#  Optimizing XML parsing with code generation

I was looking at some parsers using SAX recently, and was struck by the huge chains of if/elseif/else within the startElement and endElement callbacks. For example:

public void startElement(String uri, String name /* ... */) {
  if (name.equals("foo")) {
     // process foo element
  } else if (name.equals("bar")) {
     // process bar element
  } else if (name.equals("baz")) {
    // etc.
  }
}

One way this could be improved is to use the string-interning SAX feature, which instructs the XML parser to intern all strings it hands back. This allows one to rewrite the above method as:

public void startElement(String uri, String name /* ... */) {
  if (name == "foo") {
     // process foo element
  } else if (name == "bar") {
     // process bar element
  } else if (name == "baz") {
    // etc.
  }
}

Unfortunately, Xerces does not support this feature! [Emoticon] For my personal projects I try to use gnujaxp, which does, but applications that need to work with an arbitrary SAX parser cannot rely on interning.

But even with lame parsers we can now use the string-switching capabilities of CGLIB to gain back most of the speed of interning. I have checked net.sf.cglib.util.StringSwitcher into CVS. The API is very simple:

private static final int FOO = 0, BAR = 1, BAZ = 2;
private static final StringSwitcher SWITCHER =
  StringSwitcher.create(new String[]{ "foo", "bar", "baz" }, 
                        new int[]{ FOO, BAR, BAZ },
                        true);

Of course the string and integer arrays can be as long as you want (it would not be worth it for only three). The third parameter is a boolean that indicates whether the switcher will only be called with strings from the argument array. For validated documents this is safe and gives a 3X speedup.

Then the startElement method becomes:

public void startElement(String uri, String name, ...) {
  switch (SWITCHER.intValue(name)) {
  case FOO:
     // process foo element
     break;
  case BAR:
     // process bar element
     break;
  case BAZ:
     // etc.
  }
}

One nice thing about StringSwitcher is that the speed is O(log N) where N is the number of elements. Both of the other methods are O(N), so at some point switching will even be faster than interning. Even for a modest number of elements the overhead of StringSwitcher is respectably small. This is for 1000 repetitions over 10000 randomly chosen elements, from a vocabulary of 23 distinct elements:

MethodTime(ms)
equals3319
intern351
switch547

[Powered By FreeMarker]  [Valid Atom 1.0]  [Weblog Commenting by HaloScan.com]