Chris Nokleberg's Fizzy Weblog
Over the last few months I've been investigating adding support for MS Office 2007 file formats (aka Office Open XML) to our enterprise products. Now compared to supporting the binary file formats it is almost trivial, but there is still a fair bit of code involved!
The first step is to support the underlying packaging format, which is known as the Open Packaging Conventions. It is essentially a ZIP file with some XML files which describe the relationships between the various entries, their content types, etc. After playing with it a little while it seems pretty thought out, although of course I am biased because of a scarring exposure to OLE internals. The spec is available a number of places but the version I prefer is Part 2 of the ECMA TC45 Final Draft, available here.
To manipulate OPC documents Microsoft provides the .NET System.IO.Packaging API. When I first looked around there was no Java API so I created one for internal use, patterned after the .NET version. In the meantime Julien Chable posted what looks like a complete Java-based API along with an article describing its use on the http://openxmldeveloper.org/ site. I'm not sure if the code is open-source, though, because I can't find a license listed. Another useful tool is diffopc, which shows a graphical diff between the contents of two ZIP files (it can also be used for other ZIP-based formats like ODF).
One shared drawback of Julien's and my library is that they cannot update an existing document in-place. Instead, after you have made changes, you need to write out a new file. The primary reason for this (at least in my library) is that the Java ZIP APIs do not support it. In fact it is currently the #20 most-requested enhancement (see the full list). If you feel like helping me out then adding a vote to this bug couldn't hurt
Ant also has some ZIP code but it is basically identical to the Java API. However, it is ASL 2.0 and could be the basis for a more full-featured library. There is also the TrueZIP project, which is very full-featured, almost to a fault—it supports a bunch of other non-ZIP formats and weighs in over 200K but still does not yet support some necessary extensions (e.g. zip64). In the meantime I've decided to just stick with the java.util.zip API and live with the limitations.
p.s. my Gmail spam folder is over 34000 now (> 1000/day), a personal record—hooray.