Java Tip 128: Create a quick-and-dirty XML parser

Parse valid XML using minimal code

XML is a popular data format for several reasons: it is human readable, self-describing, and portable. Unfortunately, many Java-based XML parsers are very large; for example, Sun Microsystems’ jaxp.jar and parser.jar libraries are 1.4 MB each. If you are running with limited memory (for example, in a J2ME (Java 2 Platform, Micro Edition) environment), or bandwidth is at a premium (for example, in an applet), using those large parsers might not be a viable solution.

Those libraries’ large size is partly due to having a lot of functionality—perhaps more than you require. They validate XML DTDs (document type definitions), possibly schemas, and more. However, you might already know that your application will receive valid XML. Also, you might already decide that you want just the UTF-8 character set. Therefore, you really want event-based processing of XML elements and translation of standard XML entities—you want a nonvalidating parser.

Note: You can download this article’s source code in Resources.

Why not just use SAX?

You could implement SAX (Simple API for XML) interfaces with limited functionality, throwing an exception named NotImplemented when you encountered something unnecessary.

Undoubtedly, you could develop something much smaller than the 1.4 MB jaxp.jar/parser.jar libraries. But instead, you can cut down the code size even more by defining your own classes. In fact, the package we construct here will be considerably smaller than the jar file containing the SAX interface definitions.

Our quick-and-dirty parser is event-based like the SAX parser. Also like the SAX parser, it lets you implement an interface to catch and process events corresponding to attributes and start/end element tags. Hopefully, those of you who have used SAX will find this parser familiar.

Limit XML functionality

Many people want XML’s simple, self-describing textual data format. They want to easily pick out elements, attributes and their values, and elements’ textual content. With that in mind, let’s consider what functionality we need to preserve.

Our simple parsing package has just one class, QDParser, and one interface, DocHandler. The QDParser itself has one public static method, parse(DocHandler,Reader), which we will implement as a finite state machine.

Our limited functionality parser treats the DTD <!DOCTYPE> and processing instructions <?xml version="1.0"?> simply as comments, so it won’t be confused by their presence nor use their content.

Because we won’t process DOCTYPE, our parser cannot read custom entity definitions. We will have only the standard ones available: &amp, <, >, ‘, and “. If this is a problem, you can insert code to expand custom definitions, as the source code shows. Alternatively, you could preprocess the document—replacing custom entity definitions with their expanded text before handing the document to the QDParser.

Our parser also cannot support conditional sections; for example, <![INCLUDE[ ... ]]> or <![IGNORE[ ... ]]>. Without the ability to define custom entity definitions in DOCTYPE, we don’t really need this functionality anyway. We could process such sections, if any, before the data is sent to our limited-space application.

Because we won’t process any attribute declarations, the XML specification requires that we consider all attribute types to be CDATA. Thus, we can simply use java.util.Hashtable instead of org.xml.sax.AttributeList to hold an element’s attribute list. We have only name/value information to use in Hashtable, but we don’t need a getType() method because it would always return CDATA anyway.

The lack of attribute declarations has other consequences as well. For example, the parser won’t supply default attribute values. In addition, we can’t automatically reduce white space using a NMTOKENS declaration. However, we could handle both issues when preparing our XML document, so the extra programming could be excluded from the application using the parser.

In fact, all the missing functionality can be compensated for by preparing the document appropriately. You can offload all the work associated with the missing features (if you want them) from the quick-and-dirty parser to the document preparation step.

Parser functionality

Enough about what the parser cannot do. What can it do?

It recognizes all the elements’ start tags and end tags
It lists attributes, where attribute values can be enclosed in single or double quotes
It recognizes the <[CDATA[ ... ]]> construct
It recognizes the standard entities: &, <, >, “, and ‘, as well as numeric entities
It maps lines ending in rn and r to n on input, in accordance with the XML Specification, Section 2.11

The parser does only minimal error checking and throws an Exception if it encounters unexpected syntax, such as unknown entities. Again, however, this parser does not validate; it assumes the XML document it receives is valid.

How to use this package

Using the quick-and-dirty XML parser is simple. First, implement the DocHandler interface. Then, easily parse a file named config.xml:

  DocHandler doc = new MyDocHandler();
  QDParser.parse(doc,new FileReader("config.xml"));

The source code includes two examples that provide full DocHandler implementations. The first DocHandler, called Reporter, simply reports all events to System.out as it reads them. You can test the Reporter with the sample XML file (config.xml).

The second and more complex example, Conf, updates fields on an existing data structure that resides in memory. Conf uses the java.lang.reflect package to locate fields and objects described in config.xml. If you run this program, it will print diagnostic information telling you what objects it is updating and how. It prints error messages if the config file asks it to update nonexistent fields.

Modify this package

You’ll likely want to modify this package for your own application. You might add custom entity definitions—line 180 in QDParser.java contains an “Insert custom entity definitions here” comment.

You could also add to the finite state machine’s functionality, restoring functionality I have excluded here. If so, the source code’s small size should make this task relatively easy.

Keep it small

The QDParser class occupies around 3 KB after you compile and pack it into a jar file. The source code itself, with comments, is just over 300 lines. This should be small enough for most space-constrained applications, and retain enough of the XML specification to enjoy most of its useful features.

Steven Brandt has a
PhD in computational astrophysics and is the owner of Stevesoft, a
company that sells regular expression software for Java.

Source: www.infoworld.com