Programming XML in Java, Part 3

DOMination: Take control of structured documents with the Document Object Model

The Simple API for XML (SAX) is an excellent interface for many XML applications. It is intuitive, extremely easy to learn, and, as its name implies, simple. Any Java programmer can, in just an hour or two, learn to use and develop an application using SAX. It is especially useful in situations where the data in an XML file is already in a form that is structurally similar to the desired output. For instance, the recipe example in Part 2 of this series formatted Recipe XML into an HTML representation of a recipe page and a shopping list. The structure of the output HTML was very similar to the structure of the input XML. The ingredients in the Recipe XML were grouped together in an <Ingredients> element; the ingredients in the output HTML were grouped together in an unordered list (<ul>). The tags were somewhat different, but the basic structure was the same.

TEXTBOX:

TEXTBOX_HEAD: Programming XML in Java: Read the whole series!

:END_TEXTBOX

In real data-processing situations, however, the structure of the input data often differs greatly from the eventual output structure. Since SAX passes SAX events to a programmer-defined handler in the order in which they appear in the input XML, as the programmer you are responsible for any data restructuring or reordering. Also, if the same data is to be used in more than one place in the output, you must either perform multiple passes over the XML or arrange for the handler to “remember” that data while producing output. One example of this was the recipe title in Part 2, which the handler maintained in an internal variable for use both in the browser title bar and in the Webpage.

For tasks of low and intermediate complexity, SAX works just fine. As an application’s complexity (and functionality) increases, however, the SAX handler code can become extremely difficult to understand. SAX code can spend most of its time storing information from the input in an internal form usable for producing the desired output. When using SAX, you are generally responsible for creating an internal object model of your application’s information.

DOM to the rescue

The Document Object Model, or DOM, is a standardized object model for XML documents. DOM is a set of interfaces describing an abstract structure for an XML document. Programs that access document structures through the DOM interface can arbitrarily insert, delete, and rearrange the nodes of an XML document programmatically.

DOM and SAX parsers work in different ways. A SAX parser processes the XML document as it parses the XML input stream, passing SAX events to a programmer-defined handler method. A DOM parser, on the other hand, parses the entire input XML stream and returns a Document object. Document is the programmatic, language-neutral interface that represents a document. The Document returned by the DOM parser has an API that lets you manipulate a (virtual) tree of Node objects; this tree represents the structure of the input XML. Figures 1 and 2 illustrate this difference between the APIs.

Figure 1. The SAX parser calling programmer-defined handler routines
Figure 2. The DOM parser returning a Document object

In Figure 1, you see the SAX parser calling programmer-defined handler routines for each tag in the XML document. In Figure 2, the DOM parser returns a Document object, which represents the hierarchical structure of the tags (and such other informational elements as attributes, text blocks, and so on) in the original XML. When the parse has completed, you use the methods that are in the Document API to access the contents of the XML tree.

One major benefit of the DOM parser is that it provides random access to the structures inside the XML tree. Imagine, for example, that you are writing a genealogy application that could show any individual’s relatives from that individual’s point of view. The original XML document representing your family would include you as the child of two parents and possibly a parent of one or more children. Now let’s say you want to create a program that could print a personal report for any person in the tree. If you were to write that program using SAX, you’d have two tasks. First, you’d probably need to build a representation of your family tree in memory, so you could access any node in the tree and print that node’s relatives. Your second task, after the parse was complete, would be to print the genealogy report starting at a specified node in the tree.

A DOM parser would relieve you of the first task, building the family tree, by actually building a tree of objects for you, as shown in Figure 2. You could produce an identical report, but you’d do half as much work (or even less).

The origins of SAX and DOM are different as well. SAX, originally an interface for writing XML parsers, was created by a group of people on the XML-DEV mailing list. DOM was created and is maintained by the members of the W3C (World Wide Web Consortium) DOM working group as a standard API for accessing XML structures. In fact, many DOM parsers use a SAX parser to create the document tree that the parser returns.

It would be incorrect to say that DOM is superior to SAX. DOM provides an information model that is richer and correspondingly more complex than the one provided by SAX. With a SAX parser, the handler object receives a stream of tokens only once. A DOM parser lets you look at any node in the tree as many times as you like, manipulate the tree, write the tree out in different formats, and pass the tree to other pieces of software that understand the DOM interfaces.

So far, I’ve told you that a DOM document is made up of Node objects, but I haven’t told you precisely what a Node object is. Of exactly what kinds of objects is this document tree composed? The answer, it turns out, is that any object can appear in the tree of DOM nodes, as long as that object implements one of the DOM interfaces. I’ll look at the types of DOM interfaces in the next section.

Anatomy of a document

Figure 3 below illustrates the inheritance graph of the DOM Level 1 interfaces. (DOM Level 1 is the first, simplest implementation of DOM from the W3C. DOM Levels 2 and 3 are currently under development. See Resources for a link to the official documentation.) As you can see, just about everything in a document tree is a Node. Most DOM interfaces are descended from Node.

Figure 3. The inheritance graph of the DOM Level 1 interfaces

DOM defines a document as a tree of objects that implement the interfaces in the DOM package. All of these objects implement Node, because all of the DOM interfaces are subinterfaces of Node. Element, for example, inherits the methods of Node, as well as additional methods necessary to represent a single tag in a structure document (which is its role).

Note that the DOM package does not consist of classes; rather, it contains only interfaces (with one exception). This is because DOM is a specification of interfaces between pieces of software, not a particular implementation of DOM document Nodes. This is powerful partly because the interface specification defines what the program does, and different vendors can provide various implementations for the interfaces. In fact, most DOM parsers include implementation classes that implement all of the interfaces in the package. DOM parsers generally return trees of these implementation classes, but all the application programmer knows about these returned objects is that they implement the appropriate interface.

The Node interface represents the general node in a DOM tree.

For any particular node, the interface has methods for accessing the node’s child nodes, its parent node, and the Document node at the top of the tree in which the node lives — essentially, all of the methods needed to access and manipulate the tree of nodes. Elements, Comments, Text, and so on are all types of Nodes.

Here are the subinterfaces of Node that form the document tree:

  • Element: The Element interface represents a single tag in an XML document. (There are interfaces for such objects in the DOM for HTML as well, but I’ll limit this discussion to XML.) This interface inherits all of Node‘s methods; it also adds additional methods for manipulating Element‘s attributes and foraccessing all sub-Elements with a particular tagname.
  • CharacterData: The CharacterData interface represents (what else?) character data. Its subinterfaces are Text, CDATASection, and Comment (see below for descriptions). The CharacterData interface provides methods for adding, deleting, inserting, and otherwise manipulating the text data in the node.
  • Text: This subinterface of CharacterData is a representation of character data content within an element or attribute. The text inside a Text node contains no markup. Any entity, comment, or other text that contains markup will appear in separate nodes.
  • CDATASection: CDATASection is a subinterface of Text that can contain markup. The markup within a CDATASection is not interpreted by the XML parser. This makes it easier to create text in the document that contains many characters that might be misinterpreted as markup. A CDATASection in an XML document begins with the markup <![CDATA[ and ends with ]]>. So, for example, the following CDATASection:

    <![CDATA[Markup & Mayhem]]>
    

    represents the text:

    Markup & Mayhem
    

    It would represent:

    Markup & Mayhem
    

    outside the context of the CDATASection.

  • Attr: Attr nodes contain those variable = value pairs that you see within element tags. In the tag:

    <a href="
    

    the attribute is href, and the attribute value is the URL. Attributes are a little strange. Although they inherit the methods of their superinterface Node, they do not exist outside the context of a particular tag; they do not have their own identity. Therefore, the Node methods having to do with accessing ordered nodes in the DOM tree, such as getSibling(), simply return null when called. Attrs implement the Node interface, but aren’t truly document nodes.

  • Comment: This node simply contains a comment. One difference between SAX and DOM is that DOM preserves comments, while SAX does not (at least not in version 1). If you are writing a system that requires that comments be preserved, or that uses the contents of comments in some way, you’ll have to choose either DOM or SAX Level 2 (the new version of SAX).
  • Entity: Entity represents an entity in the XML document. The & example above is an example of an entity. In this case, it is very similar to a #define in a C program. Since any entity may be defined outside of the XML file being parsed, entities are read-only.
  • ProcessingInstruction: Processing instructions appear at the top of the XML file. The most common processing instruction in XML is the XML file declaration: <xml version="1.0">. You can use the other processing instructions to control external programs, the XML parser, or other processing steps. Processing instructions are roughly similar to #pragma in C.
  • DocumentType: The document type node is a placeholder for the DTD in the document tree. The document type node contains little more than a list of entities and notations defined for the document.

There are other interfaces in the DOM package, which you can explore by reading the documentation (see Resources). The interfaces described here are sufficient for the purposes of this discussion, however.

Now that you understand what the various DOM node interfaces do, I’ll provide an example of an XML document that has been parsed into a tree of object instances that implement these interfaces. Imagine you had the following simple XML document:

Listing 1. A simple XML document describing a word in a vocabulary

001 <?xml version="1.0" encoding="UTF-8"?>
002 <!-- This document contains a small vocabulary -->
003 <!DOCTYPE VOCABULARY SYSTEM "vocabulary.dtd">
004 <VOCABULARY>
005  <WORD GENDER="m">
006    <ENGLISH>suspicion</ENGLISH>
007    <FRENCH>soup&ccedil;on</FRENCH>
008   </WORD>
009 </VOCABULARY>

If you were to give this document to a DOM parser, you would receive as the result Document a structure that would look something like Figure 4:

Figure 4. The DOM tree corresponding to Listing 1

As you can see in Figure 4, the DOM tree directly reflects the structure of the document shown in Listing 1. At the top of the tree is a Document node, which has two children. The first child of the top Document node corresponds to DOCTYPE node on line 3. The second child of the Document node contains the WORD node. The WORD node, in turn, contains FRENCH and ENGLISH versions of the VOCABULARY word. You’ll notice in the FRENCH version of the word, there is an odd symbol &ccedil;, which signifies the French ç. This construction is known as a character entity, and is defined in the DTD file declared in line 3. (See Resources for a link to the DTD file.)

Once parsed into a DOM tree, the entire document can be accessed randomly in memory. This is useful in cases where random access to the DOM nodes is necessary. As a sample program, this article will use a multiple-language flash card program, which loads an entire vocabulary from an XML file into a DOM tree using a DOM parser. The program then quizzes the user on the vocabulary encoded in the DOM tree, selecting words from the language the user has chosen. Let’s have a look at the sample code.

A DOM example

The main class in the example program is called FlashCard. To display the words to the user, a simple JavaBean, called FlashCardCanvas, is embedded in FlashCard‘s main panel. The FlashCardCanvas JavaBean is responsible for showing the value of its single property, string, in its drawable area in the font size that fits the canvas width. It also draws red and blue lines that make the FlashCardCanvas area look like an index card.

The input vocabulary is controlled by the input file, vocab.xml. The XML in vocab.xml is described by vocabulary.dtd.

Both of these files are included in the source files in the Resources section at the end of this article.

The running program looks like the image in Figure 5:

Figure 5. The FlashCard program in action

Almost all of the program logic and XML processing code occurs in the FlashCard class, so I’ll focus on that. (The source code for the entire example can be downloaded from Resources. You can find the source code for the FlashCardCanvas class in the archive.)

FlashCard‘s method vParseArgs(), called by FlashCard‘s main method, sets the vocabulary for the program and the language to quiz. These two parameters are based on the two command-line arguments, which are the vocabulary filename (or URL) and one of the languages available (Italian, German, English, French, or Spanish). The program displays English words and prompts for the translation in the given language. The setLanguage() method simply sets the internal class field _sLanguage, but the real work occurs in the setVocabulary method, where the XML file is parsed and the DOM tree is built. Listing 2 below is the setVocabulary method for the FlashCard class.

Listing 2. The setVocabulary method parses the XML file

925 /**
926  * Set vocabulary by parsing XML file.
927  * @param sFilename java.lang.String
928  */
929 public void setVocabulary(String sFilename) {
930     // Create an instance of the DocumentBuilderFactory
931     DocumentBuilderFactory docBuilderFactory =DocumentBuilderFactory.newInstance();
932     docBuilderFactory.setValidating(true);
933     try {
934             //Get the DocumentBuilder from the factory that we just got above.
935             DocumentBuilder docBuilder =docBuilderFactory.newDocumentBuilder();
936             docBuilder.setErrorHandler(new org.xml.sax.ErrorHandler(){ // ignore fatal errors (an exception is guaranteed)
937                     public void fatalError(SAXParseExceptionexception) throws SAXException {
938                     }
939 
940                     // treat validation errors as fatal
941                     public void error(SAXParseException e) throwsSAXParseException {
942                             throw e;
943                     }
944 
945                     // dump warnings too
946                     public void warning(SAXParseException err) throws SAXParseException {
947                             System.out.println("** Warning" + ", line " + err.getLineNumber() + ", uri " +err.getSystemId());
948                             System.out.println("   " + err.getMessage());
949                     }
950             });
951 
952             // turn it into an in-memory object
953             setVocabulary(docBuilder.parse(new File(sFilename)));
954 
955     } catch (Exception ex) {
956             System.err.println("Exception: " + ex);
957             System.exit(2);
958     }
959 
960     // Missed assignment of _vocabulary AND size if parse fails.
961     if (_iVocSize < 0) {
962             System.out.println("Parse successful");
963     } else {
964             System.out.println("Parse of file " + sFilename+ " failed");
965     }
966 }
967 /**
968  * Set the vocabulary as a DOM Document
969  * @param vocabulary org.w3c.dom.Document
970  */
971 public void setVocabulary(Document vocabulary) {
972     _vocabulary = vocabulary;
973     _iVocSize = -1;
974     showFirstCueWord();
975 }

As you can see, the setVocabulary method parses the XML file into a DOM tree. The sample program uses the Sun XML parser to create the DOM structure. If you want to use a different parser, you may need to take other steps to create it. The DocumentBuilderFactory is a singleton that you can use to create parsers. Line 935 in Listing 2 shows a DocumentBuilder created by the DocumentBuilderFactory.

Lines 936 through 950 construct an anonymous error handler with the parser. Line 953 then calls the setVocabulary method, which sets the FlashCard class’ vocabulary property to the DOM Document returned by the XML parser. Lines 961 through 966 check whether any words are loaded from the input XML file by checking the word count field _iVocSize, which is set by the setVocabulary method. Notice that the parse of the XML file occurs in the expression, which is the argument to setVocabulary. Parsing the input file in building the tree is handled entirely by the call to docBuilder.parse().

The setVocabulary method in lines 971 to 975 sets an internal field _vocabulary to the Document loaded by the parser. The other methods in this class are implemented in terms of this document tree. For an example, see the method setVocabularySize in Listing 3.

Listing 3. The method setVocabularySize() counts the number of WORD nodes in the vocabulary

744 /**
745  * Return the number of words in this vocabulary.
746  * @return int
747  */
748 public int getVocabularySize() {
749     if (_iVocSize >= 0)
750         return _iVocSize;
751
752     Element voc = getVocabulary();
753     if (voc == null)
754         return 0;
755     NodeList nl = voc.getElementsByTagName("WORD");
756     _iVocSize = nl.getLength();
757     return _iVocSize;
758 }

The operation of getVocabularySize() is really quite simple. The method calculates the number of words in the vocabulary by counting the number of WORD nodes that appear under the vocabulary tag. Line 755 calls the Document method getElementsByTagName(), which returns a NodeList of all of the nodes matching the name given in the argument. Since WORD occurs once for every word in the vocabulary, the length of the NodeList (acquired in line 756) accounts for the number of words in the vocabulary. The NodeList returned by getElementsByTagName() is itself a DOM interface. A NodeList, reasonably enough, represents a list of Nodes. It has two methods: getLength(), which returns the number of nodes in the list, and item(), which returns a list element by index, starting with 0.

Listing 4. The getVocabularyWord() method returns a specific word

759 /**
760  * Return the ith vocabulary word, where 0 <= i < getVocabularySize()
761  * @return int
762  */
763 public VocabularyWord getVocabularyWord(int i) {
764     Element voc = getVocabulary();
765     if (voc == null)
766         return null;
767     NodeList nl = voc.getElementsByTagName("WORD");
768 
769     if (i < 0 || i >= nl.getLength())
770         return null;
771 
772     Element eWord = (Element)(nl.item(i));
773     VocabularyWord vwResult = new VocabularyWord(eWord);
774     return vwResult;
775 }

getVocabularyWord() in Listing 4 uses the item() method of NodeList. getVocabularyWord() is very similar to the method in Listing 3; however, instead of returning the list length, it returns the item number i. You’ll notice that the application methods are being written in terms of the DOM tree.

VocabularyWord is a convenience class I created for this application. It represents a single word in a vocabulary. It is constructed with a single Element, which must have the word tag. Listing 5 below shows the method VocabularyWord.getLanguages(), an excellent example of DOM programming.

Listing 5. Using DOM to implement VocabularyWord.getLanguages()

021 /**
022  * Return the list of strings for translation languages of current
word. Return NULL
023  * if none.
024  * @return java.lang.String[]
025  */
026 public String[] getLanguages() {
027     String[] asResult = null;
028     if (_eWord != null) {
029         NodeList nl = _eWord.getChildNodes();
030         int iLanguages = 0;
031         for (int i = 0; i < nl.getLength(); i++) {
032             if (nl.item(i) instanceof Element) {
033                 iLanguages++;
034             }
035         }
036         asResult = new String[iLanguages];
037         int j = 0;
038         for (int i = 0; i < nl.getLength(); i++) {
039             if (nl.item(i) instanceof Element) {
040                 asResult[j++] = ((Element)(nl.item(i))).getTagName();
041             }
042         }
043     }
044     return asResult;
045

The getLanguages() method in Listing 5 returns an array of strings, which is the list of available translations for a particular VocabularyWord. Line 29 gets the child nodes for the Element upon which the VocabularyWord is based. Lines 31 through 34 count the child nodes of the word Element, which are themselves Elements. This is necessary because some of the child nodes of the word Element may be Comment nodes, Text nodes containing white space, and so on. (You could, of course, check to be sure that each of these child Elements has a tag whose value is a valid language. But this is unnecessary because the DTD enforces the rule that a word’s subelements are limited to valid language names like French, German, and so on. This is a great example of how using a DTD can free you from a lot of unnecessary coding — here, the DTD does the checking for you.)

Line 36 allocates space for the result string list, and lines 38 through 42 collect the tag names of all Elements under the word Element and places them in the result list. Notice that in line 40 the result of the expression nl.item(i) has been type coerced to Element. Line 39 has just verified that item i is an Element. Why the type coercion? The item() method of NodeList returns a Node, not the Element, and so the type coercion is necessary in order to call getTagName(), which appears only in Element. I know the type coercion will succeed because of line 39.

Most of the code in this example is generated by my IDE and associates user events with method calls. The methods I’ve shown here demonstrate how to use the DOM interface calls to manipulate data from an XML document represented as a DOM tree in memory. Additional examples of how to manipulate data in a DOM tree appear in the static methods of the DomUtils convenience class (available in the source archive, which is downloadable in Resources). Reading the source code for DomUtils, try to figure out how the methods work by referring to the DOM documentation (also linked in Resources).

While the DOM is a general way to access XML structures, it also has some limitations. Let’s take a quick look at some of them.

A word about JDOM

If you read JavaWorld regularly, you’ve probably read “Easy Java/XML integration with JDOM, Part 1,” by Jason Hunter and Brett McLaughlin (see Resources). Hunter and McLaughlin found DOM difficult to use, so they created an open source Java API called JDOM that is designed to simplify XML manipulation in Java. DOM is a rather clunky API because it was designed to be usable from several different programming languages. JDOM, on the other hand, is specifically designed to take advantage of Java language features. You should understand DOM, since many systems use it, but you may find that JDOM is a much easier tool to use for manipulating XML in Java. Definitely check it out, and watch JavaWorld for upcoming articles on JDOM.

Limits to the Document Object Model

One of the most common complaints about DOM concerns the fact that it requires an entire document to be in memory. Though this is widely believed, it is not strictly true, because the DOM is only a package of interfaces, not implementations. Keeping all of the data in an XML file is not the only way to implement the DOM tree. For example, a parser could scan the input document and internally save just the tag names, attributes, and other elements that define the document structure, as well as seek positions of the document’s text blocks. This would allow random access to all of the text blocks in the original XML document, albeit with a performance penalty caused by the need to seek and lazily read requested text blocks. Still, the most common implementations of DOM interfaces do indeed slurp the whole file into memory. This approach can cause performance and/or memory problems for large input documents.

A more interesting objection to DOM use is that parsers return objects that implement the DOM interface and do little else. If you look at my sample code, for example, my VocabularyWord class wraps a DOM Element object instance, essentially extending the Element‘s functionality. But why should my VocabularyWord object have to contain some other object whose only purpose in life is to be an Element? In the worst case, you will end up with two trees: a tree of DOM-implementing objects, and a parallel tree of application objects. This is indeed a rather tedious way to code.

Fortunately, it is possible to write application objects (such as VocabularyWord) that implement DOM interfaces directly and then configure the parser to return trees of application objects instead of trees of DOM-implementing objects. You can also build DOM trees directly in memory, in code, without an XML source document. I’ll show how to do both of those things in the next article in this series on advanced DOM programming. Until then, enjoy experimenting with programming DOM in Java.

Mark Johnson works as a designer and developer
for OrganicNet in Fort Collins, Colo., by day, and as a
JavaWorld columnist by night — very late at night.

Source: www.infoworld.com