Programming XML in Java, Part 2

Experience the joy of SAX, LAX, and DTDs

If you read last month’s article, you already understand how you can use SAX (the Simple API for XML) to process XML documents. (If you haven’t read it yet, you may want to start there; see “Read the Whole Series!” below). In that article, I explained how application writers implement the SAX DocumentHandler interface, which takes a specific action when a particular condition (such as the start of a tag) occurs during the parsing of an XML document. But what good is that function? Read on.

TEXTBOX: TEXTBOX_HEAD: Programming XML in Java: Read the whole series!

:END_TEXTBOX

You’ll also remember that an XML parser checks that the document is well formed (meaning that roughly all of the open and close tags match and don’t overlap in nonsensical ways). But even well-formed documents can contain meaningless data or have a senseless structure. How can such conditions be detected and reported?

This article answers both questions through an illustrative example. I’ll start first with the latter question: once the document is parsed, how do you ensure that the XML your program is processing actually makes sense? Then I’ll demonstrate an extension to XML that I call LAX (the Lazy API for XML), which makes writing handlers for SAX events even easier. Finally, I’ll tie all of the themes together and demonstrate the technology’s usefulness with a small example that produces both formatted recipes and shopping lists from the same XML document.

Garbage in, garbage out

One thing you may have heard about XML is that it lets the system developer define custom tags. With a nonvalidating parser (discussed in Part 1 of this series), you certainly have that ability. You can make up any tag you want and, as long as you balance your open and close tags and don’t overlap them in absurd ways, the nonvalidating SAX parser will parse the document without any problems. For example, a nonvalidating SAX parser would correctly parse and fire events for the document in Listing 1.

Listing 1. A well-formed, meaningless document

<a id="Line001" name="Line001">001 <?xml version="<i>1.0</i>">
</a><a id="Line002" name="Line002">002 <b><Art</b> CENTURY="<i>20</i>">
</a><a id="Line003" name="Line003">003   <b><Dada</b>></a><a id="Line004" name="Line004">004     <b><Author</b> CENTURY="<i>18</i>" NOMDEPLUME="<i>Voltaire</i>">
</a><a id="Line005" name="Line005">005       François-Marie Arouet
</a><a id="Line006" name="Line006">006     <b></Author></b>
</a><a id="Line007" name="Line007">007     <b><Tree</b> SPECIES="<i>Maple</i>">
</a><a id="Line008" name="Line008">008       <b><Yes</b>/>
</a><a id="Line009" name="Line009">009       <b><Book</b> AUTHOR="<i>Musashi, Miyamoto</i>">
</a><a id="Line010" name="Line010">010         <b><Title</b> LANG="<i>English</i>">The Book of Five Rings<b></Title></b>
</a><a id="Line011" name="Line011">011         <b><Title</b> LANG="<i>Nihongo</i>">Go Rin No Sho<b></Title></b>
</a><a id="Line012" name="Line012">012         <b><Filter</b> POLY="<i>Chebyshev</i>" POLES="<i>2</i>"/>
</a><a id="Line013" name="Line013">013         <b><Title</b> LANG="<i>Espanol</i>">El Libro de Cinco Anillos<b></Title></b>
</a><a id="Line014" name="Line014">014         <b><Title</b> LANG="<i>Francais</i>">Le Livre de Cinq Bagues<b></Title></b>
</a><a id="Line015" name="Line015">015       <b></Book></b>
</a><a id="Line016" name="Line016">016       <b><Bahrain</b> FORMAT="<i>MP3</i>">
</a><a id="Line017" name="Line017">017         <b><Cathedral</b> CITTA="<i>Firenze</i>">
</a><a id="Line018" name="Line018">018           <b><Nome</b>>Santa Maria del Fiore<b></Nome></b>
</a><a id="Line019" name="Line019">019           <b><Architetto</b>>Brunelleschi, Filippo (1377-1466)<b></Architetto></b>
</a><a id="Line020" name="Line020">020           <b><Ora</b> FORMAT="<i>DMY24</i>">22032000134591<b></Ora></b>
</a><a id="Line021" name="Line021">021         <b></Cathedral></b>
</a><a id="Line022" name="Line022">022       <b></Bahrain></b>
</a><a id="Line023" name="Line023">023       <b><Phobias</b>>
</a><a id="Line024" name="Line024">024         <b><Herbs</b> NAME="<i>Ma Huang</i>"/>
</a><a id="Line025" name="Line025">025         <b><Appliance</b> COLOR="<i>Harvest Gold</i>">Yuck<b></Appliance></b>
</a><a id="Line026" name="Line026">026       <b></Phobias></b>
</a><a id="Line027" name="Line027">027     <b></Tree></b>
</a><a id="Line028" name="Line028">028   <b></Dada></b>
</a><a id="Line029" name="Line029">029 <b></Art></b>
</a>

A nonvalidating SAX parser would produce a valid event stream for the document in Listing 1 because the input document is well formed. It’s really stupid input, but it is well formed. Every opening tag has a corresponding close tag, and the tags don’t overlap (meaning there are no combinations of tags like <A><B></A></B>). So a nonvalidating SAX parser will have no problem with Listing 1.

Unfortunately, if you write a program that, for example, summarizes museum collections, formats architectural information, or prints multilingual card catalogs for libraries, your program could read this really stupid XML and produce really stupid output, because it might pull out tags it recognizes (like <Dada>, <Cathedral>, or <Book>). As the saying goes, “Garbage in, garbage out.”

To minimize the chance that your program produces garbage you should devise a way to detect and reject garbage in the input. Then, given meaningful input, you can focus on creating reasonable output.

Think of a document as having three levels of correctness: lexical, syntactic, and semantic. Lexical correctness is what I mean when I say “well formed”: the basic structure of the document is reasonable and correct, but nothing about the content of the tags is checked. Any tag can occur inside any other tag any number of times, any tag can take any attribute, and attributes can take on any value. So, Listing 1 is well formed, but it makes no sense, because there is no control over what tags and attributes appear in the structure, and where.

Syntactic correctness means that the document is not only well formed, but that it also contains certain tags, in certain combinations. An XML document can include a section, called a document type definition (DTD), that specifies the rules for syntactic correctness.

A DTD lets a system designer create a custom markup language, a dialect of XML. A DTD indicates which tags may (or must) occur inside other specified tags, what attributes a tag may have, the required order of the tags, and so on. A validating parser uses a DTD to check the document it is parsing for syntactic correctness. The parser prints error and warning messages for any problems it finds, and then rejects any document that doesn’t conform to the DTD. The application programmer can then write code assuming that the structure of the document is correct, because the parser already checked it.

So, for example, in Listing 1 a designer might write a DTD that defines a <Book> tag as containing only one or more <Title> tags. The parser would report the presence of the <Filter> tag in line 12 as an error, because the DTD doesn’t allow it.

A DTD is also an excellent way to specify the input to your program. An XML input document either corresponds to a particular DTD or it doesn’t. Your program can correctly process any input that conforms to a given DTD. A DTD also lets you test your application for correctness or completeness; if an input document conforms to the DTD, but your program doesn’t process it properly, then you have a bug or a missing feature.

XML parsers don’t provide much in the way of checking for semantic correctness. Semantic correctness means that the actual instance data is true for the purposes of the application. A validating parser could report an error when it finds a FORMAT attribute on a <Bahrain> tag (as occurs in line 16, Listing 1). But it’s a lot to ask any parser to check whether the Cathedral of Santa Maria del Fiore is in Bahrain or in Italy. Semantic correctness remains the domain of your application: it’s up to you to add meaning to the XML document you’ve defined. A validating XML parser and a DTD help to automate the detection of gross lexical and syntactic errors in the input to your program, allowing you to focus on the data’s meaning.

As a side note, the HTML used to create Web pages is specified in an SGML DTD, which is considerably more complex and powerful than an XML DTD. XML DTDs are essentially a subset of these SGML DTDs, with some minor notational differences. The HTML DTD clearly specifies what kind of input an HTML-processing program can accept. XHTML, an XML-compatible version of HTML, specifies an XML DTD for HTML. It has just been released by the World Wide Web Consortium (W3C).

In the next section, I’ll create a DTD for a small XML dialect for describing recipes.

Parlez-vous DTD?

Two people generally can’t talk to one another unless they speak a mutually understood language. Likewise, two programs can’t communicate via XML unless the programs agree on the XML language they use. A DTD defines a set of rules for the allowable tags and attributes in an XML document, and the order and cardinality of the tags. Programs using the DTD must still agree on what the tags mean (semantics again), but a DTD defines the words (or, the tags) and the grammatical rules for a particular XML dialect.

Listing 2 shows a simple DTD for a tiny XML language I call Recipe XML.

Listing 2. The DTD for Recipe XML

001 <b><!ELEMENT</b> Recipe (Name, Description?, Ingredients?, Instructions?)<b>></b>
002
003 <b><!ELEMENT</b> Name (#PCDATA)<b>></b>
004 
005 <b><!ELEMENT</b> Description (#PCDATA)<b>></b>
006 
007 <b><!ELEMENT</b> Ingredients (Ingredient)*<b>></b>
008 
009 <b><!ELEMENT</b> Ingredient (Qty, Item)<b>></b>
010 <b><!ATTLIST</b> Ingredient
011   vegetarian CDATA "true"<b>></b>
012 
013 <b><!ELEMENT</b> Qty (#PCDATA)<b>></b>
014 <b><!ATTLIST</b> Qty
015   unit CDATA #IMPLIED<b>></b>
016 
017 <b><!ELEMENT</b> Item (#PCDATA)<b>></b>
018 <b><!ATTLIST</b> Item
019   optional CDATA "0"<b>></b>
020 
021 <b><!ELEMENT</b> Instructions (Step)+<b>></b>
022 <b><!ELEMENT</b> Step (#PCDATA)>

The DTD in Listing 2 defines a complete, tiny language for transmitting recipes. Programs that use this DTD can count on the structure of conforming files to match the rules in the DTD.

I’ll go over this file, line by line:

001 <b><!ELEMENT</b> Recipe (Name, Description?, Ingredients?, Instructions?)<b>></b>

This line defines a tag using <!ELEMENT. The entire line from the opening <!ELEMENT to the closing > is called an element type declaration. The declaration says that a Recipe is composed of a Name, followed by the optional occurrence of a Description, Ingredients, and Instructions. The comma operator (,) indicates the valid tags the defined tag may contain, and the order in which those tags must appear. The question mark operator (?) indicates that the item to its left is optional. Since Name has only a comma operator after it, a Recipe must have precisely one Name. The parentheses are for grouping, and don’t appear in the input document.

Therefore, the sequence:

<Recipe><Name>Zabaglione</Name></Recipe>

is a valid Recipe, because it matches the DTD (that is, it consists of a <Name> followed optionally by a <Description>.) However:

<Recipe>
   <Description>Italian dessert</Description>
   <Name>Zabaglione</Name>
</Recipe>

is not a valid Recipe, because the Description comes before the Name.

003 <b><!ELEMENT</b> Name (#PCDATA)<b>></b>

This line states that a Name tag (or element) contains no other tag types, and may contain text between its open and close tags. A validating parser will mark any tag within a Name tag as an error.

007 <b><!ELEMENT</b> Ingredients (Ingredient)*<b>></b>

This line states that an Ingredients tag may contain zero or more Ingredient tags. The asterisk or star operator (*) indicates the tag’s zero-or-more cardinality.

010 <b><!ATTLIST</b> Ingredient
011   vegetarian CDATA "true"<b>></b>

An attribute list declaration, which uses <!ATTLIST, defines the attributes for a tag. Only attributes within the attribute list declaration for a tag are allowed. This line says that the Ingredient tag previously defined has a single attribute, vegetarian, which is character data (CDATA), and whose default value is "true". Attribute list declarations all follow this pattern; one may define multiple attributes, each with a type and default value, following the tag name.

014 <b><!ATTLIST</b> Qty
015   unit CDATA #IMPLIED<b>></b>

This attribute list declaration defines the default value for the unit attribute as #IMPLIED. That means that the attribute may or may not appear with the tag; if it doesn’t appear, the application supplies the value. This is how you create an optional attribute.

021 <b><!ELEMENT</b> Instructions (Step)+<b>></b>

This line states that an Instructions tag, if present, must contain at least one Step. The plus-sign operator (+) indicates one or more occurrences of the item to its left.

DTDs have more operators and conventions, but this example covers the basics. (You can find out the whole scoop on DTDs in XML in the XML recommendation; see Resources.)

DTDs are meta-information; that is, they are information about information. You may already be familiar with this concept. A table in a relational database has a schema describing such things as the column names, data types, sizes, and default values for its data. But the table description doesn’t contain data values, it contains a description of the values. Likewise, a DTD is a simple sort of schema that defines what may be in a particular document type. (There is currently an effort underway to create an XML schema that is much more like a database schema; see Resources.)

DTDs are also a bit like BNF, or Backus-Naur Form (see Resources for a discussion), which describes transformation rules for grammars; however, BNF can express structures that XML DTDs cannot.

An XML document declares its DTD with a <!DOCTYPE declaration, as shown in Listing 3. The document type specifies the external DTD used to validate the document. The top-level tag of the document must be the same as the document defined by the <!DOCTYPE (in this case, it’s Recipe.)

Listing 3. Using an external DOCTYPE declaration

<a id="Line001" name="Line001">001 <?xml version="<i>1.0</i>"?>
</a><a id="Line002" name="Line002">002 
</a><a id="Line003" name="Line003">003 <!DOCTYPE Recipe SYSTEM "example.dtd">
</a><a id="Line004" name="Line004">004 
</a><a id="Line005" name="Line005">005 <Recipe>
</a><a id="Line006" name="Line006">006    <Name>Lime Jell-O Marshmallow Cottage Cheese Surprise</Name>
</a><a id="Line007" name="Line007">007    <Description>My grandma's favorite (may she rest in peace.)</Description>
</a><a id="Line008" name="Line008">008    <Ingredients>
...<!-- and so on... -->
</a>

Line 3 in Listing 3 states that the document that follows must conform to the DTD contained in the given file, or the file itself is syntactically invalid.

A DTD may also be specified internally, as shown in Listing 4. Note that in Listing 4, the DTD is terminated by the ]> in line 19.

Listing 4. Internal document type declaration

<a id="Line001" name="Line001">001 <?xml version="<i>1.0</i>"?>
</a><a id="Line002" name="Line002">002 
</a><a id="Line003" name="Line003">003 <!DOCTYPE Recipe [
</a><a id="Line004" name="Line004">004 <!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
</a><a id="Line005" name="Line005">005 <!ELEMENT Name (#PCDATA)>
</a><a id="Line006" name="Line006">006 <!ELEMENT Description (#PCDATA)>
</a><a id="Line007" name="Line007">007 <!ELEMENT Ingredients (Ingredient)*>
</a><a id="Line008" name="Line008">008 <!ELEMENT Ingredient (Qty, Item)>
</a><a id="Line009" name="Line009">009 <!ATTLIST Ingredient
</a><a id="Line010" name="Line010">010         vegetarian CDATA "true">
</a><a id="Line011" name="Line011">011 <!ELEMENT Qty (#PCDATA)>
</a><a id="Line012" name="Line012">012 <!ATTLIST Qty
</a><a id="Line013" name="Line013">013         unit CDATA #IMPLIED>
</a><a id="Line014" name="Line014">014 <!ELEMENT Item (#PCDATA)>
</a><a id="Line015" name="Line015">015 <!ATTLIST Item
</a><a id="Line016" name="Line016">016         optional CDATA "0">
</a><a id="Line017" name="Line017">017 <!ELEMENT Instructions (Step)+>
</a><a id="Line018" name="Line018">018 <!ELEMENT Step (#PCDATA)>
</a><a id="Line019" name="Line019">019 ]>
</a><a id="Line020" name="Line020">020 <Recipe>
</a><a id="Line021" name="Line021">021    <Name>Lime Jell-O Marshmallow Cottage Cheese Surprise</Name>
</a><a id="Line022" name="Line022">022    <Description>My grandma's favorite (may she rest in peace.)</Description>
...<!-- and so on... -->
</a>

The full-text versions of these sample files, which I’ll use later, are included in the source code archives for this article. Download the source files from Resources and experiment with the SimpleValidatingSaxReporter class, which creates a validating SAX parser and then parses and validates the document against the DTD. The main program for this class appears in Listing 5.

Listing 5. Using a validating SAX parser

<a id="Line083" name="Line083">083 public static void main(String args[]) {
</a><a id="Line084" name="Line084">084         SimpleValidatingSaxReporter ssr = new SimpleValidatingSaxReporter();
</a><a id="Line085" name="Line085">085         try {
</a><a id="Line086" name="Line086">086                 ssr.parseDocument(true, ssr, args[0]);
</a><a id="Line087" name="Line087">087         } catch (Exception ex) {
</a><a id="Line088" name="Line088">088                 System.err.println(ex);
</a><a id="Line089" name="Line089">089         }
</a><a id="Line090" name="Line090">090 }
...
</a><a id="Line099" name="Line099">099 protected void parseDocument(boolean isValidating, HandlerBase handler, String sFilename) {
</a><a id="Line100" name="Line100">100         try {
</a><a id="Line101" name="Line101">101                 // Get a "parser factory", an object that creates parsers
</a><a id="Line102" name="Line102">102                 SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
</a><a id="Line103" name="Line103">103 
</a><a id="Line104" name="Line104">104                 // Set up the factory to create the appropriate type of parser
</a><a id="Line105" name="Line105">105                 saxParserFactory.setValidating(isValidating);
</a><a id="Line106" name="Line106">106                 saxParserFactory.setNamespaceAware(false); // Not this month...
</a><a id="Line107" name="Line107">107 
</a><a id="Line108" name="Line108">108                 SAXParser parser = saxParserFactory.newSAXParser();
</a><a id="Line109" name="Line109">109 
</a><a id="Line110" name="Line110">110                 parser.parse(new File(sFilename), handler);
</a><a id="Line111" name="Line111">111         } catch (Exception ex) {
</a><a id="Line112" name="Line112">112                 System.err.println("Exception: " + ex);
</a><a id="Line113" name="Line113">113                 System.exit(2);
</a><a id="Line114" name="Line114">114         }
</a><a id="Line115" name="Line115">115 }
</a>

Line 102 in Listing 5 creates a SAXParserFactory, an object that creates parsers. Lines 105 to 106 tell the parser factory what kind of parser to create, and line 108 creates the parser. Line 110 then tells the parser to parse the file, using the handler passed from main() to handle the events. The handler is the SimpleValidatingSaxReporter itself, since that class implements HandlerBase. The result is a stream of SAX events, as long as the input is valid with respect to its DTD. Experiment with the code by adding and deleting items from the sample XML files in the source archive.

You’ll notice that there are errors in example2.xml, as the parser reports:

e: file:C:/mj-java/XMLSAX2/example2.xml: line 30: org.xml.sax.SAXParseException: Element "Ingredient" does not allow "Item" here.

The parser recognizes that the order of the Qty and Item tags is reversed. If you remove the <!DOCTYPE declaration from example2.xml, you get the following error message, printed by the error handler:

w: file:C:/mj-java/XMLSAX2/example2.xml: line 6: org.xml.sax.SAXParseException: Valid documents must have a <!DOCTYPE declaration.
e: file:C:/mj-java/XMLSAX2/example2.xml: line 6: org.xml.sax.SAXParseException: Element type "Recipe" is not declared.

It prints this error message because it has no DTD to check against, so it can’t find the definition of the Recipe tag. Play around with this class to get a feel for the kind of errors the parser can catch.

Notice that since the ErrorHandler instance you’re using merely reports errors, and doesn’t exit when it receives them, the parser continues to try parsing the file. Whoever writes the error handler (that’s you) is responsible for deciding what to do when errors occur.

Now you finally know enough about using SAX and a validating parser to create an XML application. I decided to make the process easy on myself, so I created LAX, which I’ll explain next.

LAX: The Lazy API for XML

Writing a document handler for a SAX parser is pretty easy: just subclass HandlerBase, override the appropriate methods, and do whatever you like in response to the events coming from the parser. Being a lazy, and therefore virtuous, programmer (see Part 1 for an explanation), I decided to do some extra work only once in order to simplify programming in SAX for subsequent projects. Writing SAX handlers is just too much work.

You see, when you override startElement(), endElement, or characters(), you always have to check the tag name to decide what to do. So, these methods typically become large if-then-else blocks. It requires a lot of typing, which always opens the door for errors, plus I simply can’t be bothered to do it. So, I created LAX, the Lazy API for XML.

LAX lets any class use only naming conventions to handle SAX events, in much the same way JavaBeans introspection identifies properties and event sets by examining a class’s method signatures. A class can become an event handler simply by defining methods with the appropriate name and signature. There’s no need for a class to override HandlerBase, since LAX does that for you. To use LAX, simply create a LAX object, register it as a document handler for the parser, then register your handler objects with LAX. LAX translates the stream of XML events into method calls on your objects.

LAX uses Java reflection to find methods in your classes to handle tags in the XML being parsed. When LAX encounters a tag called, say, <Tag>, it searches through all of its handlers (instances of classes that you’ve written), looking for objects that have either method void startTag() or void startTag(AttributeList list), and calls that method on any such object it finds. When it encounters the end tag </Tag>, it searches all of its handlers for a method called void endTag(), and calls any such methods it finds. When LAX encounters characters (in its own characters() method), it remembers the current tag, and searches for and calls all methods with the signature void textOfTag(String string).

As a result, you don’t need to write huge if-then-else statements, implement DocumentHandler, or extend HandlerBase. Simply write methods with the appropriate signatures, register an instance of your class with LAX, and parse the input document with the LAX object as a document handler. What could be easier?

The source code for LAX is in the source archive for this article, available in Resources below. Now I’ll develop a sample program using LAX.

Managing recipes with LAX

I often get email from readers who want to know how to use Java to process XML into HTML (or other formats) for display. The following example shows one way to use the same XML to create HTML files for different purposes, with different formatting, in a single processing step. The popularity of CGI, ASP (Active Server Pages), and JSP (JavaServer Pages) notwithstanding, I oppose writing any code that has hard-coded print statements spitting out HTML. Style languages such as CSS (Cascading Style Sheets), DSSSL (Document Style Semantics and Specification Language, SGML’s style language), and XSL (Extensible Stylesheet Language) are more appropriate for the task of transforming data into something presentable. (Why that’s true is material for another time — it’s a big topic.) Nevertheless, I understand that using coded programs to create HTML is commonly used, and it makes for an enlightening example of using SAX to turn XML into some other useful form.

For this example, I have two valid Recipe XML files: example4.xml, my standard heinous lime Jell-O creation; and a new recipe for Nanner Pah, example3.xml, a big hit at all the Lutheran church dinners I went to as a kid.

I decided I wanted to use LAX to write a program that produces two files: a well-formatted recipe page for a cookbook, and a shopping list for the recipe, also attractively formatted. To accomplish that, I created two classes: RecipeWriter and ShoppingListWriter. I’ll go over each class in turn, and then show how you can use them both with LAX.

Formatting a recipe

The RecipeWriter class has start, end, and textOf methods for each tag type in the Recipe DTD. I’ll discuss how a couple of them work so you can get a feel for what the class does. You can follow along in the source code for RecipeWriter.java.

The RecipeWriter constructor, which takes a filename as an argument, creates the named file and opens it for writing. Subsequent method calls cause HTML to be written to the output file, and endRecipe() eventually closes it.

The top-level tag of the Recipe XML is <Recipe>, but RecipeWriter doesn’t have a startRecipe() method, so that event is skipped. When LAX encounters characters inside a <Name> tag, though, it finds the RecipeWriter‘s textOfName() method, which it calls with the text of the recipe name. textOfName() calls titlePrint(), which sets up the HTML page, sets the body background image, and opens up a TABLE (which will be closed by endRecipe()). startDescription(), startIngredients(), and startInstructions() all produce rows in the table with attractive background colors and large header text.

This cookbook is designed to be used by both vegetarians and nonvegetarians, so notice that RecipeWriter has a boolean variable called _isVegetarian, which is set to “false” if any nonvegetarian ingredient is encountered by startIngredients. After parsing is completed, endRecipe() checks this flag, and places an indication after the recipe of whether the recipe is vegetarian. Likewise, startItem checks for the OPTIONAL attribute, and prints “(optional)” after each optional ingredient.

You can see the results of running the RecipeWriter on example4.xml in example4-recipe.html and on example3.xml in example3-recipe.html.

Formatting a shopping list

At the same time that the recipe is being formatted by RecipeWriter, LAX also maintains an instance of ShoppingListWriter, which is creating a different file. You can follow along in the source code for ShoppingListWriter.java.

Like RecipeWriter, ShoppingListWriter creates and opens its output file in its constructor. Since a shopping list is primarily concerned with <Ingredients>, it doesn’t print anything until LAX calls startIngredients() (startName() saves the name in an instance field for use in startIngredients). The program builds an HTML table on top of a spiral-notebook background, and prints all optional attributes in red (so if you don’t bring enough money to the grocery store, you’ll know what you can do without.)

You can see the results of the ShoppingListWriter in example4-list.html and example3-list.html. Currently, you can’t merge, sort, or add the contents of the two lists — n recipes gives you n lists. But there’s no reason you couldn’t write a class that does any or all of those things.

The main LAX program

Listing 6 shows the main() method for LAX. You can read the full source code in Lax.java.

Listing 6. LAX main() method

<a id="Line133" name="Line133">133 public static void main(String args[]) {
</a><a id="Line134" name="Line134">134         if (args.length < 1) {
</a><a id="Line135" name="Line135">135                 System.err.println("Usage: lax inputFile.xml [parserClass]");
</a><a id="Line136" name="Line136">136                 System.exit(1);
</a><a id="Line137" name="Line137">137         }
</a><a id="Line138" name="Line138">138 
</a><a id="Line139" name="Line139">139         String sInputFile = args[0];
</a><a id="Line140" name="Line140">140         String sRecipeFile;
</a><a id="Line141" name="Line141">141         String sShoppingListFile;
</a><a id="Line142" name="Line142">142         String sBase = sInputFile;
</a><a id="Line143" name="Line143">143         
</a><a id="Line144" name="Line144">144         if (sBase.length() > 4 && sBase.toLowerCase().endsWith(".xml")) {
</a><a id="Line145" name="Line145">145                 sBase = sBase.substring(0, sBase.length() - 4);
</a><a id="Line146" name="Line146">146         } else {
</a><a id="Line147" name="Line147">147                 sInputFile = sBase + ".xml";
</a><a id="Line148" name="Line148">148         }
</a><a id="Line149" name="Line149">149         sRecipeFile = sBase + "-recipe.html";
</a><a id="Line150" name="Line150">150         sShoppingListFile = sBase + "-list.html";
</a><a id="Line151" name="Line151">151 
</a><a id="Line152" name="Line152">152         Lax lax = new Lax();
</a><a id="Line153" name="Line153">153 
</a><a id="Line154" name="Line154">154         ShoppingListWriter slw = new ShoppingListWriter(sShoppingListFile);
</a><a id="Line155" name="Line155">155         lax.addHandler(slw);
</a><a id="Line156" name="Line156">156         
</a><a id="Line157" name="Line157">157         RecipeWriter rw = new RecipeWriter(sRecipeFile);
</a><a id="Line158" name="Line158">158         lax.addHandler(rw);
</a><a id="Line159" name="Line159">159 
</a><a id="Line160" name="Line160">160         lax.parseDocument(true, lax, sInputFile);
</a><a id="Line161" name="Line161">161 
</a><a id="Line162" name="Line162">162 
</a><a id="Line163" name="Line163">163 }
</a>

Lines 139 through 150 build the output filenames, which are used to create the ShoppingListWriter (line 154) and the RecipeWriter (line 157) for LAX. Line 152 creates an instance of LAX, and in lines 155 and 158, LAX gets the ShoppingListWriter and RecipeWriter objects. Line 160 then parses the file with a copy of parseDocument() taken directly from SimpleValidatingSaxParser.

As LAX receives each event, it searches both of the handler objects for appropriate methods, and calls the methods when they are found. As a result, both output files are written in one pass of the parser. That’s all there is to it! All of the application logic (to create the HTML) occurs in the handler objects, and LAX handles dispatching the tag names in the events to the methods in the handlers.

Using XML and SAX in this way opens a lot of doors. With a little imagination, it’s easy to envision a servlet that reads a directory of XML files and creates pages of links to the formatted recipes and shopping lists. The formatted recipes and shopping lists could even be created on the fly by the servlet from the XML. Updating an XML file with new information would then automatically update both the recipe and the shopping list — they would never get out of sync. This data consistency is one of the benefits of using XML to represent information and then styling the XML for various presentations.

Final notes about XML and DTDs

Note that, in this Recipe example, the recipe’s content was separate from its presentation; that is, the XML represented the information in the recipe, while the LAX handler classes formatted and displayed that information. A different LAX class could display the document in an entirely different way, or even read it aloud, yet the underlying XML document would not have changed at all. This separation of content and presentation is one of the key themes in the architecture of modern document-processing systems.

In addition, while creating a DTD might seem like a simple proposition, it’s actually one of the most difficult parts of creating large integrated document-management systems. The syntax of DTDs isn’t too difficult once you get accustomed to it, but getting the DTD right requires a great deal of analysis, and the consequences of poor DTD design can haunt a project forever. DTD design has quite a lot in common with database design, especially in terms of normalization and denormalization of the information being represented.

DTDs are particularly useful for describing standard document formats for information interchange between open systems. Many groups and consortia are currently working out XML DTDs for everything from vector graphics to chemical formulas and molecules to shoe inventories.

Conclusion

This article has covered a lot of ground: you’ve learned what DTDs are and how they work, you’ve discovered LAX (which I think will make SAX even easier for you to use), and you’ve seen how you can use a single XML document in different contexts. I hope these small examples get you thinking about how you can use the technology.

SAX is an excellent way to process XML for many applications, but for complex transformations of XML data, it’s sometimes necessary to get at nodes all over the document “tree.” You can also use XML as a serialization mechanism; that is, for creating documents that represent arbitrary structures of objects, and recreating those object structures from documents. For those purposes, the Document Object Model (DOM) can be useful. In the next article in this series, you’ll learn how to use the DOM to do more sophisticated processing of XML documents.

Mark Johnson works as a designer and developer
for OrganicNet in Fort Collins, Colo., by day, and as a
JavaWorld columnist by night — very late at night.

Source: www.infoworld.com