Programming XML in Java, Part 2
Experience the joy of SAX, LAX, and DTDs
If you read last month’s article, you already understand how you can use SAX (the Simple API for XML) to process XML documents. (If you haven’t read it yet, you may want to start there; see “Read the Whole Series!” below). In that article, I explained how application writers implement the SAX DocumentHandler
interface, which takes a specific action when a particular condition (such as the start of a tag) occurs during the parsing of an XML document. But what good is that function? Read on.
TEXTBOX: TEXTBOX_HEAD: Programming XML in Java: Read the whole series!
- Part 1. Use the Simple API for XML (SAX) to process XML in Java easily
- Part 2. Learn about SAX and XML validation through illustrative examples
- Part 3. DOMination: Take control of structured documents with the Document Object Model
:END_TEXTBOX
You’ll also remember that an XML parser checks that the document is well formed (meaning that roughly all of the open and close tags match and don’t overlap in nonsensical ways). But even well-formed documents can contain meaningless data or have a senseless structure. How can such conditions be detected and reported?
This article answers both questions through an illustrative example. I’ll start first with the latter question: once the document is parsed, how do you ensure that the XML your program is processing actually makes sense? Then I’ll demonstrate an extension to XML that I call LAX (the Lazy API for XML), which makes writing handlers for SAX events even easier. Finally, I’ll tie all of the themes together and demonstrate the technology’s usefulness with a small example that produces both formatted recipes and shopping lists from the same XML document.
Garbage in, garbage out
One thing you may have heard about XML is that it lets the system developer define custom tags. With a nonvalidating parser (discussed in Part 1 of this series), you certainly have that ability. You can make up any tag you want and, as long as you balance your open and close tags and don’t overlap them in absurd ways, the nonvalidating SAX parser will parse the document without any problems. For example, a nonvalidating SAX parser would correctly parse and fire events for the document in Listing 1.
Listing 1. A well-formed, meaningless document
<a id="Line001" name="Line001">001 <?xml version="<i>1.0</i>">
</a><a id="Line002" name="Line002">002 <b><Art</b> CENTURY="<i>20</i>">
</a><a id="Line003" name="Line003">003 <b><Dada</b>>
</a><a id="Line004" name="Line004">004 <b><Author</b> CENTURY="<i>18</i>" NOMDEPLUME="<i>Voltaire</i>">
</a><a id="Line005" name="Line005">005 François-Marie Arouet
</a><a id="Line006" name="Line006">006 <b></Author></b>
</a><a id="Line007" name="Line007">007 <b><Tree</b> SPECIES="<i>Maple</i>">
</a><a id="Line008" name="Line008">008 <b><Yes</b>/>
</a><a id="Line009" name="Line009">009 <b><Book</b> AUTHOR="<i>Musashi, Miyamoto</i>">
</a><a id="Line010" name="Line010">010 <b><Title</b> LANG="<i>English</i>">The Book of Five Rings<b></Title></b>
</a><a id="Line011" name="Line011">011 <b><Title</b> LANG="<i>Nihongo</i>">Go Rin No Sho<b></Title></b>
</a><a id="Line012" name="Line012">012 <b><Filter</b> POLY="<i>Chebyshev</i>" POLES="<i>2</i>"/>
</a><a id="Line013" name="Line013">013 <b><Title</b> LANG="<i>Espanol</i>">El Libro de Cinco Anillos<b></Title></b>
</a><a id="Line014" name="Line014">014 <b><Title</b> LANG="<i>Francais</i>">Le Livre de Cinq Bagues<b></Title></b>
</a><a id="Line015" name="Line015">015 <b></Book></b>
</a><a id="Line016" name="Line016">016 <b><Bahrain</b> FORMAT="<i>MP3</i>">
</a><a id="Line017" name="Line017">017 <b><Cathedral</b> CITTA="<i>Firenze</i>">
</a><a id="Line018" name="Line018">018 <b><Nome</b>>Santa Maria del Fiore<b></Nome></b>
</a><a id="Line019" name="Line019">019 <b><Architetto</b>>Brunelleschi, Filippo (1377-1466)<b></Architetto></b>
</a><a id="Line020" name="Line020">020 <b><Ora</b> FORMAT="<i>DMY24</i>">22032000134591<b></Ora></b>
</a><a id="Line021" name="Line021">021 <b></Cathedral></b>
</a><a id="Line022" name="Line022">022 <b></Bahrain></b>
</a><a id="Line023" name="Line023">023 <b><Phobias</b>>
</a><a id="Line024" name="Line024">024 <b><Herbs</b> NAME="<i>Ma Huang</i>"/>
</a><a id="Line025" name="Line025">025 <b><Appliance</b> COLOR="<i>Harvest Gold</i>">Yuck<b></Appliance></b>
</a><a id="Line026" name="Line026">026 <b></Phobias></b>
</a><a id="Line027" name="Line027">027 <b></Tree></b>
</a><a id="Line028" name="Line028">028 <b></Dada></b>
</a><a id="Line029" name="Line029">029 <b></Art></b>
</a>
A nonvalidating SAX parser would produce a valid event stream for the document in Listing 1 because the input document is well formed. It’s really stupid input, but it is well formed. Every opening tag has a corresponding close tag, and the tags don’t overlap (meaning there are no combinations of tags like <A><B></A></B>
). So a nonvalidating SAX parser will have no problem with Listing 1.
Unfortunately, if you write a program that, for example, summarizes museum collections, formats architectural information, or prints multilingual card catalogs for libraries, your program could read this really stupid XML and produce really stupid output, because it might pull out tags it recognizes (like <Dada>
, <Cathedral>
, or <Book>
). As the saying goes, “Garbage in, garbage out.”
To minimize the chance that your program produces garbage you should devise a way to detect and reject garbage in the input. Then, given meaningful input, you can focus on creating reasonable output.
Think of a document as having three levels of correctness: lexical, syntactic, and semantic. Lexical correctness is what I mean when I say “well formed”: the basic structure of the document is reasonable and correct, but nothing about the content of the tags is checked. Any tag can occur inside any other tag any number of times, any tag can take any attribute, and attributes can take on any value. So, Listing 1 is well formed, but it makes no sense, because there is no control over what tags and attributes appear in the structure, and where.
Syntactic correctness means that the document is not only well formed, but that it also contains certain tags, in certain combinations. An XML document can include a section, called a document type definition (DTD), that specifies the rules for syntactic correctness.
A DTD lets a system designer create a custom markup language, a dialect of XML. A DTD indicates which tags may (or must) occur inside other specified tags, what attributes a tag may have, the required order of the tags, and so on. A validating parser uses a DTD to check the document it is parsing for syntactic correctness. The parser prints error and warning messages for any problems it finds, and then rejects any document that doesn’t conform to the DTD. The application programmer can then write code assuming that the structure of the document is correct, because the parser already checked it.
So, for example, in Listing 1 a designer might write a DTD that defines a <Book>
tag as containing only one or more <Title>
tags. The parser would report the presence of the <Filter>
tag in line 12 as an error, because the DTD doesn’t allow it.
A DTD is also an excellent way to specify the input to your program. An XML input document either corresponds to a particular DTD or it doesn’t. Your program can correctly process any input that conforms to a given DTD. A DTD also lets you test your application for correctness or completeness; if an input document conforms to the DTD, but your program doesn’t process it properly, then you have a bug or a missing feature.
XML parsers don’t provide much in the way of checking for semantic correctness. Semantic correctness means that the actual instance data is true for the purposes of the application. A validating parser could report an error when it finds a FORMAT
attribute on a <Bahrain>
tag (as occurs in line 16, Listing 1). But it’s a lot to ask any parser to check whether the Cathedral of Santa Maria del Fiore is in Bahrain or in Italy. Semantic correctness remains the domain of your application: it’s up to you to add meaning to the XML document you’ve defined. A validating XML parser and a DTD help to automate the detection of gross lexical and syntactic errors in the input to your program, allowing you to focus on the data’s meaning.
As a side note, the HTML used to create Web pages is specified in an SGML DTD, which is considerably more complex and powerful than an XML DTD. XML DTDs are essentially a subset of these SGML DTDs, with some minor notational differences. The HTML DTD clearly specifies what kind of input an HTML-processing program can accept. XHTML, an XML-compatible version of HTML, specifies an XML DTD for HTML. It has just been released by the World Wide Web Consortium (W3C).
In the next section, I’ll create a DTD for a small XML dialect for describing recipes.
Parlez-vous DTD?
Two people generally can’t talk to one another unless they speak a mutually understood language. Likewise, two programs can’t communicate via XML unless the programs agree on the XML language they use. A DTD defines a set of rules for the allowable tags and attributes in an XML document, and the order and cardinality of the tags. Programs using the DTD must still agree on what the tags mean (semantics again), but a DTD defines the words (or, the tags) and the grammatical rules for a particular XML dialect.
Listing 2 shows a simple DTD for a tiny XML language I call Recipe XML.
Listing 2. The DTD for Recipe XML
001 <b><!ELEMENT</b> Recipe (Name, Description?, Ingredients?, Instructions?)<b>></b>
002
003 <b><!ELEMENT</b> Name (#PCDATA)<b>></b>
004
005 <b><!ELEMENT</b> Description (#PCDATA)<b>></b>
006
007 <b><!ELEMENT</b> Ingredients (Ingredient)*<b>></b>
008
009 <b><!ELEMENT</b> Ingredient (Qty, Item)<b>></b>
010 <b><!ATTLIST</b> Ingredient
011 vegetarian CDATA "true"<b>></b>
012
013 <b><!ELEMENT</b> Qty (#PCDATA)<b>></b>
014 <b><!ATTLIST</b> Qty
015 unit CDATA #IMPLIED<b>></b>
016
017 <b><!ELEMENT</b> Item (#PCDATA)<b>></b>
018 <b><!ATTLIST</b> Item
019 optional CDATA "0"<b>></b>
020
021 <b><!ELEMENT</b> Instructions (Step)+<b>></b>
022 <b><!ELEMENT</b> Step (#PCDATA)>
The DTD in Listing 2 defines a complete, tiny language for transmitting recipes. Programs that use this DTD can count on the structure of conforming files to match the rules in the DTD.
I’ll go over this file, line by line:
001 <b><!ELEMENT</b> Recipe (Name, Description?, Ingredients?, Instructions?)<b>></b>
This line defines a tag using <!ELEMENT
. The entire line from the opening <!ELEMENT
to the closing >
is called an element type declaration. The declaration says that a Recipe
is composed of a Name
, followed by the optional occurrence of a Description
, Ingredients
, and Instructions
. The comma operator (,
) indicates the valid tags the defined tag may contain, and the order in which those tags must appear. The question mark operator (?
) indicates that the item to its left is optional. Since Name
has only a comma operator after it, a Recipe
must have precisely one Name
. The parentheses are for grouping, and don’t appear in the input document.
Therefore, the sequence:
<Recipe><Name>Zabaglione</Name></Recipe>
is a valid Recipe, because it matches the DTD (that is, it consists of a <Name>
followed optionally by a <Description>
.) However:
<Recipe>
<Description>Italian dessert</Description>
<Name>Zabaglione</Name>
</Recipe>
is not a valid Recipe, because the Description
comes before the Name
.
003 <b><!ELEMENT</b> Name (#PCDATA)<b>></b>
This line states that a Name
tag (or element) contains no other tag types, and may contain text between its open and close tags. A validating parser will mark any tag within a Name
tag as an error.
007 <b><!ELEMENT</b> Ingredients (Ingredient)*<b>></b>
This line states that an Ingredients
tag may contain zero or more Ingredient
tags. The asterisk or star operator (*
) indicates the tag’s zero-or-more cardinality.
010 <b><!ATTLIST</b> Ingredient
011 vegetarian CDATA "true"<b>></b>
An attribute list declaration, which uses <!ATTLIST
, defines the attributes for a tag. Only attributes within the attribute list declaration for a tag are allowed. This line says that the Ingredient
tag previously defined has a single attribute, vegetarian
, which is character data (CDATA
), and whose default value is "true"
. Attribute list declarations all follow this pattern; one may define multiple attributes, each with a type and default value, following the tag name.
014 <b><!ATTLIST</b> Qty
015 unit CDATA #IMPLIED<b>></b>
This attribute list declaration defines the default value for the unit
attribute as #IMPLIED
. That means that the attribute may or may not appear with the tag; if it doesn’t appear, the application supplies the value. This is how you create an optional attribute.
021 <b><!ELEMENT</b> Instructions (Step)+<b>></b>
This line states that an Instructions
tag, if present, must contain at least one Step
. The plus-sign operator (+
) indicates one or more occurrences of the item to its left.
DTDs have more operators and conventions, but this example covers the basics. (You can find out the whole scoop on DTDs in XML in the XML recommendation; see Resources.)
DTDs are meta-information; that is, they are information about information. You may already be familiar with this concept. A table in a relational database has a schema describing such things as the column names, data types, sizes, and default values for its data. But the table description doesn’t contain data values, it contains a description of the values. Likewise, a DTD is a simple sort of schema that defines what may be in a particular document type. (There is currently an effort underway to create an XML schema that is much more like a database schema; see Resources.)
DTDs are also a bit like BNF, or Backus-Naur Form (see Resources for a discussion), which describes transformation rules for grammars; however, BNF can express structures that XML DTDs cannot.
An XML document declares its DTD with a <!DOCTYPE
declaration, as shown in Listing 3. The document type specifies the external DTD used to validate the document. The top-level tag of the document must be the same as the document defined by the <!DOCTYPE
(in this case, it’s Recipe
.)
Listing 3. Using an external DOCTYPE declaration
<a id="Line001" name="Line001">001 <?xml version="<i>1.0</i>"?>
</a><a id="Line002" name="Line002">002
</a><a id="Line003" name="Line003">003 <!DOCTYPE Recipe SYSTEM "example.dtd">
</a><a id="Line004" name="Line004">004
</a><a id="Line005" name="Line005">005 <Recipe>
</a><a id="Line006" name="Line006">006 <Name>Lime Jell-O Marshmallow Cottage Cheese Surprise</Name>
</a><a id="Line007" name="Line007">007 <Description>My grandma's favorite (may she rest in peace.)</Description>
</a><a id="Line008" name="Line008">008 <Ingredients>
...<!-- and so on... -->
</a>
Line 3 in Listing 3 states that the document that follows must conform to the DTD contained in the given file, or the file itself is syntactically invalid.
A DTD may also be specified internally, as shown in Listing 4. Note that in Listing 4, the DTD is terminated by the ]>
in line 19.
Listing 4. Internal document type declaration
<a id="Line001" name="Line001">001 <?xml version="<i>1.0</i>"?>
</a><a id="Line002" name="Line002">002
</a><a id="Line003" name="Line003">003 <!DOCTYPE Recipe [
</a><a id="Line004" name="Line004">004 <!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
</a><a id="Line005" name="Line005">005 <!ELEMENT Name (#PCDATA)>
</a><a id="Line006" name="Line006">006 <!ELEMENT Description (#PCDATA)>
</a><a id="Line007" name="Line007">007 <!ELEMENT Ingredients (Ingredient)*>
</a><a id="Line008" name="Line008">008 <!ELEMENT Ingredient (Qty, Item)>
</a><a id="Line009" name="Line009">009 <!ATTLIST Ingredient
</a><a id="Line010" name="Line010">010 vegetarian CDATA "true">
</a><a id="Line011" name="Line011">011 <!ELEMENT Qty (#PCDATA)>
</a><a id="Line012" name="Line012">012 <!ATTLIST Qty
</a><a id="Line013" name="Line013">013 unit CDATA #IMPLIED>
</a><a id="Line014" name="Line014">014 <!ELEMENT Item (#PCDATA)>
</a><a id="Line015" name="Line015">015 <!ATTLIST Item
</a><a id="Line016" name="Line016">016 optional CDATA "0">
</a><a id="Line017" name="Line017">017 <!ELEMENT Instructions (Step)+>
</a><a id="Line018" name="Line018">018 <!ELEMENT Step (#PCDATA)>
</a><a id="Line019" name="Line019">019 ]>
</a><a id="Line020" name="Line020">020 <Recipe>
</a><a id="Line021" name="Line021">021 <Name>Lime Jell-O Marshmallow Cottage Cheese Surprise</Name>
</a><a id="Line022" name="Line022">022 <Description>My grandma's favorite (may she rest in peace.)</Description>
...<!-- and so on... -->
</a>
The full-text versions of these sample files, which I’ll use later, are included in the source code archives for this article. Download the source files from Resources and experiment with the SimpleValidatingSaxReporter
class, which creates a validating SAX parser and then parses and validates the document against the DTD. The main program for this class appears in Listing 5.
Listing 5. Using a validating SAX parser
<a id="Line083" name="Line083">083 public static void main(String args[]) {
</a><a id="Line084" name="Line084">084 SimpleValidatingSaxReporter ssr = new SimpleValidatingSaxReporter();
</a><a id="Line085" name="Line085">085 try {
</a><a id="Line086" name="Line086">086 ssr.parseDocument(true, ssr, args[0]);
</a><a id="Line087" name="Line087">087 } catch (Exception ex) {
</a><a id="Line088" name="Line088">088 System.err.println(ex);
</a><a id="Line089" name="Line089">089 }
</a><a id="Line090" name="Line090">090 }
...
</a><a id="Line099" name="Line099">099 protected void parseDocument(boolean isValidating, HandlerBase handler, String sFilename) {
</a><a id="Line100" name="Line100">100 try {
</a><a id="Line101" name="Line101">101 // Get a "parser factory", an object that creates parsers
</a><a id="Line102" name="Line102">102 SAXParserFactory saxParserFactory = SAXParserFactory.newInstance();
</a><a id="Line103" name="Line103">103
</a><a id="Line104" name="Line104">104 // Set up the factory to create the appropriate type of parser
</a><a id="Line105" name="Line105">105 saxParserFactory.setValidating(isValidating);
</a><a id="Line106" name="Line106">106 saxParserFactory.setNamespaceAware(false); // Not this month...
</a><a id="Line107" name="Line107">107
</a><a id="Line108" name="Line108">108 SAXParser parser = saxParserFactory.newSAXParser();
</a><a id="Line109" name="Line109">109
</a><a id="Line110" name="Line110">110 parser.parse(new File(sFilename), handler);
</a><a id="Line111" name="Line111">111 } catch (Exception ex) {
</a><a id="Line112" name="Line112">112 System.err.println("Exception: " + ex);
</a><a id="Line113" name="Line113">113 System.exit(2);
</a><a id="Line114" name="Line114">114 }
</a><a id="Line115" name="Line115">115 }
</a>
Line 102 in Listing 5 creates a SAXParserFactory
, an object that creates parsers. Lines 105 to 106 tell the parser factory what kind of parser to create, and line 108 creates the parser. Line 110 then tells the parser to parse the file, using the handler passed from main()
to handle the events. The handler is the SimpleValidatingSaxReporter
itself, since that class implements HandlerBase
. The result is a stream of SAX events, as long as the input is valid with respect to its DTD. Experiment with the code by adding and deleting items from the sample XML files in the source archive.
You’ll notice that there are errors in example2.xml
, as the parser reports:
e: file:C:/mj-java/XMLSAX2/example2.xml: line 30: org.xml.sax.SAXParseException: Element "Ingredient" does not allow "Item" here.
The parser recognizes that the order of the Qty
and Item
tags is reversed. If you remove the <!DOCTYPE
declaration from example2.xml
, you get the following error message, printed by the error handler:
w: file:C:/mj-java/XMLSAX2/example2.xml: line 6: org.xml.sax.SAXParseException: Valid documents must have a <!DOCTYPE declaration.
e: file:C:/mj-java/XMLSAX2/example2.xml: line 6: org.xml.sax.SAXParseException: Element type "Recipe" is not declared.
It prints this error message because it has no DTD to check against, so it can’t find the definition of the Recipe
tag. Play around with this class to get a feel for the kind of errors the parser can catch.
Notice that since the ErrorHandler
instance you’re using merely reports errors, and doesn’t exit when it receives them, the parser continues to try parsing the file. Whoever writes the error handler (that’s you) is responsible for deciding what to do when errors occur.
Now you finally know enough about using SAX and a validating parser to create an XML application. I decided to make the process easy on myself, so I created LAX, which I’ll explain next.
LAX: The Lazy API for XML
Writing a document handler for a SAX parser is pretty easy: just subclass HandlerBase
, override the appropriate methods, and do whatever you like in response to the events coming from the parser. Being a lazy, and therefore virtuous, programmer (see Part 1 for an explanation), I decided to do some extra work only once in order to simplify programming in SAX for subsequent projects. Writing SAX handlers is just too much work.
You see, when you override startElement()
, endElement
, or characters()
, you always have to check the tag name to decide what to do. So, these methods typically become large if-then-else blocks. It requires a lot of typing, which always opens the door for errors, plus I simply can’t be bothered to do it. So, I created LAX, the Lazy API for XML.
LAX lets any class use only naming conventions to handle SAX events, in much the same way JavaBeans introspection identifies properties and event sets by examining a class’s method signatures. A class can become an event handler simply by defining methods with the appropriate name and signature. There’s no need for a class to override HandlerBase
, since LAX does that for you. To use LAX, simply create a LAX object, register it as a document handler for the parser, then register your handler objects with LAX. LAX translates the stream of XML events into method calls on your objects.
LAX uses Java reflection to find methods in your classes to handle tags in the XML being parsed. When LAX encounters a tag called, say, <Tag>
, it searches through all of its handlers (instances of classes that you’ve written), looking for objects that have either method void startTag()
or void startTag(AttributeList list)
, and calls that method on any such object it finds. When it encounters the end tag </Tag>
, it searches all of its handlers for a method called void endTag()
, and calls any such methods it finds. When LAX encounters characters (in its own characters()
method), it remembers the current tag, and searches for and calls all methods with the signature void textOfTag(String string)
.
As a result, you don’t need to write huge if-then-else statements, implement DocumentHandler
, or extend HandlerBase
. Simply write methods with the appropriate signatures, register an instance of your class with LAX, and parse the input document with the LAX object as a document handler. What could be easier?
The source code for LAX is in the source archive for this article, available in Resources below. Now I’ll develop a sample program using LAX.
Managing recipes with LAX
I often get email from readers who want to know how to use Java to process XML into HTML (or other formats) for display. The following example shows one way to use the same XML to create HTML files for different purposes, with different formatting, in a single processing step. The popularity of CGI, ASP (Active Server Pages), and JSP (JavaServer Pages) notwithstanding, I oppose writing any code that has hard-coded print statements spitting out HTML. Style languages such as CSS (Cascading Style Sheets), DSSSL (Document Style Semantics and Specification Language, SGML’s style language), and XSL (Extensible Stylesheet Language) are more appropriate for the task of transforming data into something presentable. (Why that’s true is material for another time — it’s a big topic.) Nevertheless, I understand that using coded programs to create HTML is commonly used, and it makes for an enlightening example of using SAX to turn XML into some other useful form.
For this example, I have two valid Recipe XML files: example4.xml
, my standard heinous lime Jell-O creation; and a new recipe for Nanner Pah, example3.xml
, a big hit at all the Lutheran church dinners I went to as a kid.
I decided I wanted to use LAX to write a program that produces two files: a well-formatted recipe page for a cookbook, and a shopping list for the recipe, also attractively formatted. To accomplish that, I created two classes: RecipeWriter
and ShoppingListWriter
. I’ll go over each class in turn, and then show how you can use them both with LAX.
Formatting a recipe
The RecipeWriter
class has start, end, and textOf methods for each tag type in the Recipe DTD. I’ll discuss how a couple of them work so you can get a feel for what the class does. You can follow along in the source code for RecipeWriter.java
.
The RecipeWriter
constructor, which takes a filename as an argument, creates the named file and opens it for writing. Subsequent method calls cause HTML to be written to the output file, and endRecipe()
eventually closes it.
The top-level tag of the Recipe XML is <Recipe>
, but RecipeWriter
doesn’t have a startRecipe()
method, so that event is skipped. When LAX encounters characters inside a <Name>
tag, though, it finds the RecipeWriter
‘s textOfName()
method, which it calls with the text of the recipe name. textOfName()
calls titlePrint()
, which sets up the HTML page, sets the body background image, and opens up a TABLE
(which will be closed by endRecipe()
). startDescription()
, startIngredients()
, and startInstructions()
all produce rows in the table with attractive background colors and large header text.
This cookbook is designed to be used by both vegetarians and nonvegetarians, so notice that RecipeWriter
has a boolean variable called _isVegetarian
, which is set to “false” if any nonvegetarian ingredient is encountered by startIngredients
. After parsing is completed, endRecipe()
checks this flag, and places an indication after the recipe of whether the recipe is vegetarian. Likewise, startItem
checks for the OPTIONAL
attribute, and prints “(optional)” after each optional ingredient.
You can see the results of running the RecipeWriter
on example4.xml
in example4-recipe.html
and on example3.xml
in example3-recipe.html
.
Formatting a shopping list
At the same time that the recipe is being formatted by RecipeWriter
, LAX also maintains an instance of ShoppingListWriter
, which is creating a different file. You can follow along in the source code for ShoppingListWriter.java
.
Like RecipeWriter
, ShoppingListWriter
creates and opens its output file in its constructor. Since a shopping list is primarily concerned with <Ingredients>
, it doesn’t print anything until LAX calls startIngredients()
(startName()
saves the name in an instance field for use in startIngredients
). The program builds an HTML table on top of a spiral-notebook background, and prints all optional attributes in red (so if you don’t bring enough money to the grocery store, you’ll know what you can do without.)
You can see the results of the ShoppingListWriter
in example4-list.html
and example3-list.html
. Currently, you can’t merge, sort, or add the contents of the two lists — n recipes gives you n lists. But there’s no reason you couldn’t write a class that does any or all of those things.
The main LAX program
Listing 6 shows the main()
method for LAX. You can read the full source code in Lax.java
.
Listing 6. LAX main() method
<a id="Line133" name="Line133">133 public static void main(String args[]) {
</a><a id="Line134" name="Line134">134 if (args.length < 1) {
</a><a id="Line135" name="Line135">135 System.err.println("Usage: lax inputFile.xml [parserClass]");
</a><a id="Line136" name="Line136">136 System.exit(1);
</a><a id="Line137" name="Line137">137 }
</a><a id="Line138" name="Line138">138
</a><a id="Line139" name="Line139">139 String sInputFile = args[0];
</a><a id="Line140" name="Line140">140 String sRecipeFile;
</a><a id="Line141" name="Line141">141 String sShoppingListFile;
</a><a id="Line142" name="Line142">142 String sBase = sInputFile;
</a><a id="Line143" name="Line143">143
</a><a id="Line144" name="Line144">144 if (sBase.length() > 4 && sBase.toLowerCase().endsWith(".xml")) {
</a><a id="Line145" name="Line145">145 sBase = sBase.substring(0, sBase.length() - 4);
</a><a id="Line146" name="Line146">146 } else {
</a><a id="Line147" name="Line147">147 sInputFile = sBase + ".xml";
</a><a id="Line148" name="Line148">148 }
</a><a id="Line149" name="Line149">149 sRecipeFile = sBase + "-recipe.html";
</a><a id="Line150" name="Line150">150 sShoppingListFile = sBase + "-list.html";
</a><a id="Line151" name="Line151">151
</a><a id="Line152" name="Line152">152 Lax lax = new Lax();
</a><a id="Line153" name="Line153">153
</a><a id="Line154" name="Line154">154 ShoppingListWriter slw = new ShoppingListWriter(sShoppingListFile);
</a><a id="Line155" name="Line155">155 lax.addHandler(slw);
</a><a id="Line156" name="Line156">156
</a><a id="Line157" name="Line157">157 RecipeWriter rw = new RecipeWriter(sRecipeFile);
</a><a id="Line158" name="Line158">158 lax.addHandler(rw);
</a><a id="Line159" name="Line159">159
</a><a id="Line160" name="Line160">160 lax.parseDocument(true, lax, sInputFile);
</a><a id="Line161" name="Line161">161
</a><a id="Line162" name="Line162">162
</a><a id="Line163" name="Line163">163 }
</a>
Lines 139 through 150 build the output filenames, which are used to create the ShoppingListWriter
(line 154) and the RecipeWriter
(line 157) for LAX. Line 152 creates an instance of LAX, and in lines 155 and 158, LAX gets the ShoppingListWriter
and RecipeWriter
objects. Line 160 then parses the file with a copy of parseDocument()
taken directly from SimpleValidatingSaxParser
.
As LAX receives each event, it searches both of the handler objects for appropriate methods, and calls the methods when they are found. As a result, both output files are written in one pass of the parser. That’s all there is to it! All of the application logic (to create the HTML) occurs in the handler objects, and LAX handles dispatching the tag names in the events to the methods in the handlers.
Using XML and SAX in this way opens a lot of doors. With a little imagination, it’s easy to envision a servlet that reads a directory of XML files and creates pages of links to the formatted recipes and shopping lists. The formatted recipes and shopping lists could even be created on the fly by the servlet from the XML. Updating an XML file with new information would then automatically update both the recipe and the shopping list — they would never get out of sync. This data consistency is one of the benefits of using XML to represent information and then styling the XML for various presentations.
Final notes about XML and DTDs
Note that, in this Recipe example, the recipe’s content was separate from its presentation; that is, the XML represented the information in the recipe, while the LAX handler classes formatted and displayed that information. A different LAX class could display the document in an entirely different way, or even read it aloud, yet the underlying XML document would not have changed at all. This separation of content and presentation is one of the key themes in the architecture of modern document-processing systems.
In addition, while creating a DTD might seem like a simple proposition, it’s actually one of the most difficult parts of creating large integrated document-management systems. The syntax of DTDs isn’t too difficult once you get accustomed to it, but getting the DTD right requires a great deal of analysis, and the consequences of poor DTD design can haunt a project forever. DTD design has quite a lot in common with database design, especially in terms of normalization and denormalization of the information being represented.
DTDs are particularly useful for describing standard document formats for information interchange between open systems. Many groups and consortia are currently working out XML DTDs for everything from vector graphics to chemical formulas and molecules to shoe inventories.
Conclusion
This article has covered a lot of ground: you’ve learned what DTDs are and how they work, you’ve discovered LAX (which I think will make SAX even easier for you to use), and you’ve seen how you can use a single XML document in different contexts. I hope these small examples get you thinking about how you can use the technology.
SAX is an excellent way to process XML for many applications, but for complex transformations of XML data, it’s sometimes necessary to get at nodes all over the document “tree.” You can also use XML as a serialization mechanism; that is, for creating documents that represent arbitrary structures of objects, and recreating those object structures from documents. For those purposes, the Document Object Model (DOM) can be useful. In the next article in this series, you’ll learn how to use the DOM to do more sophisticated processing of XML documents.