Programming XML in Java, Part 3
DOMination: Take control of structured documents with the Document Object Model
The Simple API for XML (SAX) is an excellent interface for many XML applications. It is intuitive, extremely easy to learn, and, as its name implies, simple. Any Java programmer can, in just an hour or two, learn to use and develop an application using SAX. It is especially useful in situations where the data in an XML file is already in a form that is structurally similar to the desired output. For instance, the recipe example in Part 2 of this series formatted Recipe XML into an HTML representation of a recipe page and a shopping list. The structure of the output HTML was very similar to the structure of the input XML. The ingredients in the Recipe XML were grouped together in an <Ingredients>
element; the ingredients in the output HTML were grouped together in an unordered list (<ul>
). The tags were somewhat different, but the basic structure was the same.
TEXTBOX:
TEXTBOX_HEAD: Programming XML in Java: Read the whole series!
- Part 1. Use the Simple API for XML (SAX) to process XML in Java easily
- Part 2. Learn about SAX and XML validation through illustrative examples
- Part 3. DOMination: Take control of structured documents with the Document Object Model
:END_TEXTBOX
In real data-processing situations, however, the structure of the input data often differs greatly from the eventual output structure. Since SAX passes SAX events to a programmer-defined handler in the order in which they appear in the input XML, as the programmer you are responsible for any data restructuring or reordering. Also, if the same data is to be used in more than one place in the output, you must either perform multiple passes over the XML or arrange for the handler to “remember” that data while producing output. One example of this was the recipe title in Part 2, which the handler maintained in an internal variable for use both in the browser title bar and in the Webpage.
For tasks of low and intermediate complexity, SAX works just fine. As an application’s complexity (and functionality) increases, however, the SAX handler code can become extremely difficult to understand. SAX code can spend most of its time storing information from the input in an internal form usable for producing the desired output. When using SAX, you are generally responsible for creating an internal object model of your application’s information.
DOM to the rescue
The Document Object Model, or DOM, is a standardized object model for XML documents. DOM is a set of interfaces describing an abstract structure for an XML document. Programs that access document structures through the DOM interface can arbitrarily insert, delete, and rearrange the nodes of an XML document programmatically.
DOM and SAX parsers work in different ways. A SAX parser processes the XML document as it parses the XML input stream, passing SAX events to a programmer-defined handler method. A DOM parser, on the other hand, parses the entire input XML stream and returns a Document
object. Document
is the programmatic, language-neutral interface that represents a document. The Document
returned by the DOM parser has an API that lets you manipulate a (virtual) tree of Node
objects; this tree represents the structure of the input XML. Figures 1 and 2 illustrate this difference between the APIs.
In Figure 1, you see the SAX parser calling programmer-defined handler routines for each tag in the XML document. In Figure 2, the DOM parser returns a Document
object, which represents the hierarchical structure of the tags (and such other informational elements as attributes, text blocks, and so on) in the original XML. When the parse has completed, you use the methods that are in the Document API to access the contents of the XML tree.
One major benefit of the DOM parser is that it provides random access to the structures inside the XML tree. Imagine, for example, that you are writing a genealogy application that could show any individual’s relatives from that individual’s point of view. The original XML document representing your family would include you as the child of two parents and possibly a parent of one or more children. Now let’s say you want to create a program that could print a personal report for any person in the tree. If you were to write that program using SAX, you’d have two tasks. First, you’d probably need to build a representation of your family tree in memory, so you could access any node in the tree and print that node’s relatives. Your second task, after the parse was complete, would be to print the genealogy report starting at a specified node in the tree.
A DOM parser would relieve you of the first task, building the family tree, by actually building a tree of objects for you, as shown in Figure 2. You could produce an identical report, but you’d do half as much work (or even less).
The origins of SAX and DOM are different as well. SAX, originally an interface for writing XML parsers, was created by a group of people on the XML-DEV mailing list. DOM was created and is maintained by the members of the W3C (World Wide Web Consortium) DOM working group as a standard API for accessing XML structures. In fact, many DOM parsers use a SAX parser to create the document tree that the parser returns.
It would be incorrect to say that DOM is superior to SAX. DOM provides an information model that is richer and correspondingly more complex than the one provided by SAX. With a SAX parser, the handler object receives a stream of tokens only once. A DOM parser lets you look at any node in the tree as many times as you like, manipulate the tree, write the tree out in different formats, and pass the tree to other pieces of software that understand the DOM interfaces.
So far, I’ve told you that a DOM document is made up of Node
objects, but I haven’t told you precisely what a Node
object is. Of exactly what kinds of objects is this document tree composed? The answer, it turns out, is that any object can appear in the tree of DOM nodes, as long as that object implements one of the DOM interfaces. I’ll look at the types of DOM interfaces in the next section.
Anatomy of a document
Figure 3 below illustrates the inheritance graph of the DOM Level 1 interfaces. (DOM Level 1 is the first, simplest implementation of DOM from the W3C. DOM Levels 2 and 3 are currently under development. See Resources for a link to the official documentation.) As you can see, just about everything in a document tree is a Node
. Most DOM interfaces are descended from Node
.
DOM defines a document as a tree of objects that implement the interfaces in the DOM package. All of these objects implement Node
, because all of the DOM interfaces are subinterfaces of Node
. Element
, for example, inherits the methods of Node
, as well as additional methods necessary to represent a single tag in a structure document (which is its role).
Note that the DOM package does not consist of classes; rather, it contains only interfaces (with one exception). This is because DOM is a specification of interfaces between pieces of software, not a particular implementation of DOM document Node
s. This is powerful partly because the interface specification defines what the program does, and different vendors can provide various implementations for the interfaces. In fact, most DOM parsers include implementation classes that implement all of the interfaces in the package. DOM parsers generally return trees of these implementation classes, but all the application programmer knows about these returned objects is that they implement the appropriate interface.
The Node
interface represents the general node in a DOM tree.
For any particular node, the interface has methods for accessing the node’s child nodes, its parent node, and the Document
node at the top of the tree in which the node lives — essentially, all of the methods needed to access and manipulate the tree of nodes. Element
s, Comment
s, Text
, and so on are all types of Node
s.
Here are the subinterfaces of Node
that form the document tree:
Element
: TheElement
interface represents a single tag in an XML document. (There are interfaces for such objects in the DOM for HTML as well, but I’ll limit this discussion to XML.) This interface inherits all ofNode
‘s methods; it also adds additional methods for manipulatingElement
‘s attributes and foraccessing all sub-Element
s with a particular tagname.CharacterData
: TheCharacterData
interface represents (what else?) character data. Its subinterfaces areText
,CDATASection
, andComment
(see below for descriptions). TheCharacterData
interface provides methods for adding, deleting, inserting, and otherwise manipulating the text data in the node.Text
: This subinterface ofCharacterData
is a representation of character data content within an element or attribute. The text inside aText
node contains no markup. Any entity, comment, or other text that contains markup will appear in separate nodes.-
CDATASection
:CDATASection
is a subinterface ofText
that can contain markup. The markup within aCDATASection
is not interpreted by the XML parser. This makes it easier to create text in the document that contains many characters that might be misinterpreted as markup. ACDATASection
in an XML document begins with the markup<![CDATA[
and ends with]]>
. So, for example, the followingCDATASection
:<![CDATA[Markup & Mayhem]]>
represents the text:
Markup & Mayhem
It would represent:
Markup & Mayhem
outside the context of the
CDATASection
. -
Attr
:Attr
nodes contain thosevariable = value
pairs that you see within element tags. In the tag:<a href="
the attribute is
href
, and the attribute value is the URL. Attributes are a little strange. Although they inherit the methods of their superinterfaceNode
, they do not exist outside the context of a particular tag; they do not have their own identity. Therefore, theNode
methods having to do with accessing ordered nodes in the DOM tree, such asgetSibling()
, simply returnnull
when called.Attr
s implement theNode
interface, but aren’t truly document nodes. Comment
: This node simply contains a comment. One difference between SAX and DOM is that DOM preserves comments, while SAX does not (at least not in version 1). If you are writing a system that requires that comments be preserved, or that uses the contents of comments in some way, you’ll have to choose either DOM or SAX Level 2 (the new version of SAX).Entity
:Entity
represents an entity in the XML document. The&
example above is an example of an entity. In this case, it is very similar to a#define
in a C program. Since any entity may be defined outside of the XML file being parsed, entities are read-only.ProcessingInstruction
: Processing instructions appear at the top of the XML file. The most common processing instruction in XML is the XML file declaration:<xml version="1.0">
. You can use the other processing instructions to control external programs, the XML parser, or other processing steps. Processing instructions are roughly similar to#pragma
in C.DocumentType
: The document type node is a placeholder for the DTD in the document tree. The document type node contains little more than a list of entities and notations defined for the document.
There are other interfaces in the DOM package, which you can explore by reading the documentation (see Resources). The interfaces described here are sufficient for the purposes of this discussion, however.
Now that you understand what the various DOM node interfaces do, I’ll provide an example of an XML document that has been parsed into a tree of object instances that implement these interfaces. Imagine you had the following simple XML document:
Listing 1. A simple XML document describing a word in a vocabulary
001 <?xml version="1.0" encoding="UTF-8"?>
002 <!-- This document contains a small vocabulary -->
003 <!DOCTYPE VOCABULARY SYSTEM "vocabulary.dtd">
004 <VOCABULARY>
005 <WORD GENDER="m">
006 <ENGLISH>suspicion</ENGLISH>
007 <FRENCH>soupçon</FRENCH>
008 </WORD>
009 </VOCABULARY>
If you were to give this document to a DOM parser, you would receive as the result Document
a structure that would look something like Figure 4:
As you can see in Figure 4, the DOM tree directly reflects the structure of the document shown in Listing 1. At the top of the tree is a Document
node, which has two children. The first child of the top Document
node corresponds to DOCTYPE
node on line 3. The second child of the Document
node contains the WORD
node. The WORD
node, in turn, contains FRENCH
and ENGLISH
versions of the VOCABULARY
word. You’ll notice in the FRENCH
version of the word, there is an odd symbol ç
, which signifies the French ç. This construction is known as a character entity, and is defined in the DTD file declared in line 3. (See Resources for a link to the DTD file.)
Once parsed into a DOM tree, the entire document can be accessed randomly in memory. This is useful in cases where random access to the DOM nodes is necessary. As a sample program, this article will use a multiple-language flash card program, which loads an entire vocabulary from an XML file into a DOM tree using a DOM parser. The program then quizzes the user on the vocabulary encoded in the DOM tree, selecting words from the language the user has chosen. Let’s have a look at the sample code.
A DOM example
The main class in the example program is called FlashCard
. To display the words to the user, a simple JavaBean, called FlashCardCanvas
, is embedded in FlashCard
‘s main panel. The FlashCardCanvas
JavaBean is responsible for showing the value of its single property, string
, in its drawable area in the font size that fits the canvas width. It also draws red and blue lines that make the FlashCardCanvas
area look like an index card.
The input vocabulary is controlled by the input file, vocab.xml
. The XML in vocab.xml
is described by vocabulary.dtd
.
Both of these files are included in the source files in the Resources section at the end of this article.
The running program looks like the image in Figure 5:
Almost all of the program logic and XML processing code occurs in the FlashCard
class, so I’ll focus on that. (The source code for the entire example can be downloaded from Resources. You can find the source code for the FlashCardCanvas
class in the archive.)
FlashCard
‘s method vParseArgs()
, called by FlashCard
‘s main method, sets the vocabulary for the program and the language to quiz. These two parameters are based on the two command-line arguments, which are the vocabulary filename (or URL) and one of the languages available (Italian, German, English, French, or Spanish). The program displays English words and prompts for the translation in the given language. The setLanguage()
method simply sets the internal class field _sLanguage
, but the real work occurs in the setVocabulary
method, where the XML file is parsed and the DOM tree is built. Listing 2 below is the setVocabulary
method for the FlashCard
class.
Listing 2. The setVocabulary method parses the XML file
925 /**
926 * Set vocabulary by parsing XML file.
927 * @param sFilename java.lang.String
928 */
929 public void setVocabulary(String sFilename) {
930 // Create an instance of the DocumentBuilderFactory
931 DocumentBuilderFactory docBuilderFactory =DocumentBuilderFactory.newInstance();
932 docBuilderFactory.setValidating(true);
933 try {
934 //Get the DocumentBuilder from the factory that we just got above.
935 DocumentBuilder docBuilder =docBuilderFactory.newDocumentBuilder();
936 docBuilder.setErrorHandler(new org.xml.sax.ErrorHandler(){ // ignore fatal errors (an exception is guaranteed)
937 public void fatalError(SAXParseExceptionexception) throws SAXException {
938 }
939
940 // treat validation errors as fatal
941 public void error(SAXParseException e) throwsSAXParseException {
942 throw e;
943 }
944
945 // dump warnings too
946 public void warning(SAXParseException err) throws SAXParseException {
947 System.out.println("** Warning" + ", line " + err.getLineNumber() + ", uri " +err.getSystemId());
948 System.out.println(" " + err.getMessage());
949 }
950 });
951
952 // turn it into an in-memory object
953 setVocabulary(docBuilder.parse(new File(sFilename)));
954
955 } catch (Exception ex) {
956 System.err.println("Exception: " + ex);
957 System.exit(2);
958 }
959
960 // Missed assignment of _vocabulary AND size if parse fails.
961 if (_iVocSize < 0) {
962 System.out.println("Parse successful");
963 } else {
964 System.out.println("Parse of file " + sFilename+ " failed");
965 }
966 }
967 /**
968 * Set the vocabulary as a DOM Document
969 * @param vocabulary org.w3c.dom.Document
970 */
971 public void setVocabulary(Document vocabulary) {
972 _vocabulary = vocabulary;
973 _iVocSize = -1;
974 showFirstCueWord();
975 }
As you can see, the setVocabulary
method parses the XML file into a DOM tree. The sample program uses the Sun XML parser to create the DOM structure. If you want to use a different parser, you may need to take other steps to create it. The DocumentBuilderFactory
is a singleton that you can use to create parsers. Line 935 in Listing 2 shows a DocumentBuilder
created by the DocumentBuilderFactory
.
Lines 936 through 950 construct an anonymous error handler with the parser. Line 953 then calls the setVocabulary
method, which sets the FlashCard
class’ vocabulary property to the DOM Document
returned by the XML parser. Lines 961 through 966 check whether any words are loaded from the input XML file by checking the word count field _iVocSize
, which is set by the setVocabulary
method. Notice that the parse of the XML file occurs in the expression, which is the argument to setVocabulary
. Parsing the input file in building the tree is handled entirely by the call to docBuilder.parse()
.
The setVocabulary
method in lines 971 to 975 sets an internal field _vocabulary
to the Document
loaded by the parser. The other methods in this class are implemented in terms of this document tree. For an example, see the method setVocabularySize
in Listing 3.
Listing 3. The method setVocabularySize() counts the number of WORD nodes in the vocabulary
744 /**
745 * Return the number of words in this vocabulary.
746 * @return int
747 */
748 public int getVocabularySize() {
749 if (_iVocSize >= 0)
750 return _iVocSize;
751
752 Element voc = getVocabulary();
753 if (voc == null)
754 return 0;
755 NodeList nl = voc.getElementsByTagName("WORD");
756 _iVocSize = nl.getLength();
757 return _iVocSize;
758 }
The operation of getVocabularySize()
is really quite simple. The method calculates the number of words in the vocabulary by counting the number of WORD
nodes that appear under the vocabulary tag. Line 755 calls the Document
method getElementsByTagName()
, which returns a NodeList
of all of the nodes matching the name given in the argument. Since WORD
occurs once for every word in the vocabulary, the length of the NodeList
(acquired in line 756) accounts for the number of words in the vocabulary. The NodeList
returned by getElementsByTagName()
is itself a DOM interface. A NodeList
, reasonably enough, represents a list of Node
s. It has two methods: getLength()
, which returns the number of nodes in the list, and item()
, which returns a list element by index, starting with 0.
Listing 4. The getVocabularyWord() method returns a specific word
759 /**
760 * Return the ith vocabulary word, where 0 <= i < getVocabularySize()
761 * @return int
762 */
763 public VocabularyWord getVocabularyWord(int i) {
764 Element voc = getVocabulary();
765 if (voc == null)
766 return null;
767 NodeList nl = voc.getElementsByTagName("WORD");
768
769 if (i < 0 || i >= nl.getLength())
770 return null;
771
772 Element eWord = (Element)(nl.item(i));
773 VocabularyWord vwResult = new VocabularyWord(eWord);
774 return vwResult;
775 }
getVocabularyWord()
in Listing 4 uses the item()
method of NodeList
. getVocabularyWord()
is very similar to the method in Listing 3; however, instead of returning the list length, it returns the item number i
. You’ll notice that the application methods are being written in terms of the DOM tree.
VocabularyWord
is a convenience class I created for this application. It represents a single word in a vocabulary. It is constructed with a single Element
, which must have the word tag. Listing 5 below shows the method VocabularyWord.getLanguages()
, an excellent example of DOM programming.
Listing 5. Using DOM to implement VocabularyWord.getLanguages()
021 /**
022 * Return the list of strings for translation languages of current
word. Return NULL
023 * if none.
024 * @return java.lang.String[]
025 */
026 public String[] getLanguages() {
027 String[] asResult = null;
028 if (_eWord != null) {
029 NodeList nl = _eWord.getChildNodes();
030 int iLanguages = 0;
031 for (int i = 0; i < nl.getLength(); i++) {
032 if (nl.item(i) instanceof Element) {
033 iLanguages++;
034 }
035 }
036 asResult = new String[iLanguages];
037 int j = 0;
038 for (int i = 0; i < nl.getLength(); i++) {
039 if (nl.item(i) instanceof Element) {
040 asResult[j++] = ((Element)(nl.item(i))).getTagName();
041 }
042 }
043 }
044 return asResult;
045
The getLanguages()
method in Listing 5 returns an array of strings, which is the list of available translations for a particular VocabularyWord
. Line 29 gets the child nodes for the Element
upon which the VocabularyWord
is based. Lines 31 through 34 count the child nodes of the word Element
, which are themselves Element
s. This is necessary because some of the child nodes of the word Element
may be Comment
nodes, Text
nodes containing white space, and so on. (You could, of course, check to be sure that each of these child Element
s has a tag whose value is a valid language. But this is unnecessary because the DTD enforces the rule that a word’s subelements are limited to valid language names like French, German, and so on. This is a great example of how using a DTD can free you from a lot of unnecessary coding — here, the DTD does the checking for you.)
Line 36 allocates space for the result string list, and lines 38 through 42 collect the tag names of all Element
s under the word Element
and places them in the result list. Notice that in line 40 the result of the expression nl.item(i)
has been type coerced to Element
. Line 39 has just verified that item i
is an Element
. Why the type coercion? The item()
method of NodeList
returns a Node
, not the Element
, and so the type coercion is necessary in order to call getTagName()
, which appears only in Element
. I know the type coercion will succeed because of line 39.
Most of the code in this example is generated by my IDE and associates user events with method calls. The methods I’ve shown here demonstrate how to use the DOM interface calls to manipulate data from an XML document represented as a DOM tree in memory. Additional examples of how to manipulate data in a DOM tree appear in the static methods of the DomUtils
convenience class (available in the source archive, which is downloadable in Resources). Reading the source code for DomUtils
, try to figure out how the methods work by referring to the DOM documentation (also linked in Resources).
While the DOM is a general way to access XML structures, it also has some limitations. Let’s take a quick look at some of them.
A word about JDOM
If you read JavaWorld regularly, you’ve probably read “Easy Java/XML integration with JDOM, Part 1,” by Jason Hunter and Brett McLaughlin (see Resources). Hunter and McLaughlin found DOM difficult to use, so they created an open source Java API called JDOM that is designed to simplify XML manipulation in Java. DOM is a rather clunky API because it was designed to be usable from several different programming languages. JDOM, on the other hand, is specifically designed to take advantage of Java language features. You should understand DOM, since many systems use it, but you may find that JDOM is a much easier tool to use for manipulating XML in Java. Definitely check it out, and watch JavaWorld for upcoming articles on JDOM.
Limits to the Document Object Model
One of the most common complaints about DOM concerns the fact that it requires an entire document to be in memory. Though this is widely believed, it is not strictly true, because the DOM is only a package of interfaces, not implementations. Keeping all of the data in an XML file is not the only way to implement the DOM tree. For example, a parser could scan the input document and internally save just the tag names, attributes, and other elements that define the document structure, as well as seek positions of the document’s text blocks. This would allow random access to all of the text blocks in the original XML document, albeit with a performance penalty caused by the need to seek and lazily read requested text blocks. Still, the most common implementations of DOM interfaces do indeed slurp the whole file into memory. This approach can cause performance and/or memory problems for large input documents.
A more interesting objection to DOM use is that parsers return objects that implement the DOM interface and do little else. If you look at my sample code, for example, my VocabularyWord
class wraps a DOM Element
object instance, essentially extending the Element
‘s functionality. But why should my VocabularyWord
object have to contain some other object whose only purpose in life is to be an Element
? In the worst case, you will end up with two trees: a tree of DOM-implementing objects, and a parallel tree of application objects. This is indeed a rather tedious way to code.
Fortunately, it is possible to write application objects (such as VocabularyWord
) that implement DOM interfaces directly and then configure the parser to return trees of application objects instead of trees of DOM-implementing objects. You can also build DOM trees directly in memory, in code, without an XML source document. I’ll show how to do both of those things in the next article in this series on advanced DOM programming. Until then, enjoy experimenting with programming DOM in Java.