Use Xbeans to process your XML as DOM Documents
The Extensible Markup Language, or XML, has gained widespread popularity as a way to represent data in a portable, vendor-neutral, readable format. Many software vendors have announced “support for XML,” usually meaning their products will produce or consume XML data.
XML is also being viewed as the lingua franca for data exchange between enterprises. It allows enterprises to agree on XML Document Type Definitions (DTDs) for the data being exchanged. These DTDs are independent of the database schema used by the enterprises.
Standards groups representing almost every human endeavor are agreeing on DTDs for exchanging data. One of many examples is the International Press Telecommunications Council (see Resources), which has defined an XML DTD that allows “news information to be transferred with markup and be easily transformed into an electronically publishable format.” Such vertical market standards will allow diverse applications to exchange data in unforeseen ways.
But what good are portable, vendor-neutral data if you don’t share and process them? The ability to communicate and process XML between distributed computers is desirable. An application that communicates and processes XML between computers is, in fact, a distributed application.
This article explores such distributed applications written in Java. I’ll focus on the communication of XML between Java code running in different virtual machines.
The communication of XML
The specification of XML defined by the World Wide Web Consortium, or W3C (see Resources), defines the syntax and semantics of the language. To process XML, an XML document needs to be parsed. It would be regrettable if every Java class that needed to process XML had to parse an XML document, given the complexity of XML’s syntax and semantics. To solve this problem, the W3C has defined the Document Object Model (DOM) (see Resources). The DOM is an application programmer’s interface to XML data. It is available from many programming languages, including Java. Java programs can access XML data via the DOM API. XML parsers produce a DOM representation of an XML document.
Figure 1 illustrates a simplified model of a Java distributed application that processes XML. The model is sufficient for the purpose of this article: to explore the communication of XML. The model assumes that some data are obtained from a data source such as a relational database. Some Java code processes the data and ultimately produces a DOM representation. This code is represented in Figure 1 as the processor.
The processor code passes the DOM representation of the XML data to the sender. The sender is Java code that communicates the XML data to the receiver. The receiver is Java code that receives the XML data, produces a DOM representation of the data, and passes it to another processor. In short, the sender and the receiver abstract the communication of the DOM representation of XML data.
The sender and the receiver are not implemented in the same Java Virtual Machine. They are connected by a distributed system infrastructure. There are several approaches to implementing the sender and the receiver.
Note that in the model in Figure 1, the sender is a client of the receiver. The sender passes the XML to the receiver. In another possible model, the receiver is the client; it requests the document from the sender. I will not explore the second model in this article since the issues of communicating XML are similar.
Xbeans
As you will see, the three implementations of the sender and receiver described here are implemented as Xbeans. An Xbean
is a software component that takes XML as input, processes it, and then passes XML on to the next Xbean
. Xbeans
consume and produce XML as DOM documents. That is, the data passed to Xbeans
are not strings that need to be parsed by an XML parser; rather, they make up an already parsed document object accessed via the W3C standard DOM API. Figure 2 illustrates an Xbean
.
Xbeans
are JavaBeans, which support the packaging, reuse, connection, and customization of Java code. With the appropriate set of Xbeans
and a JavaBean design tool, you can build useful distributed applications with little or no programming.
Xbeans
are inspired by IBM’s XML Productivity Kit for Java. They are focused on distributed applications and thus have a modified interface. Xbeans
are freely available through the Xbeans.org open source project.
Approaches to implementing sender and receiver
Now I’ll describe for you the three different approaches to implementing the sender and the receiver, complete with Java code. After presenting them, I’ll analyze the performance of each approach.
Approach 1: Use a standard Web server
The first approach is to simply send the XML as text to the Web server on a remote machine. The sender must convert the DOM representation of the XML into text and communicate the text to the receiver. Then, the receiver must convert the text back to the DOM representation. Figure 3 illustrates this approach.
The following code fragment uses HTTP to implement the sender. The code uses the DOMWriter
class in IBM’s Productivity Kit for Java to convert the DOM representation to a textual XML representation. See Resources for the complete code.
public void documentReady(DOMEvent evt)
throws XbeansException {
try {
URL receiver = new URL(getRemoteURL());
URLConnection receiverConnection =
receiver.openConnection();
receiverConnection.setDoOutput(true);
// open an output stream to the sender and
// send the xml in text form
OutputStream out =
receiverConnection.getOutputStream();
DOMWriter writer = new DOMWriter();
writer.setPrintWriter(new PrintWriter(out));
writer.documentReady(
new com.ibm.xml.xpk4j.dom.DOMEvent(
this,evt.getDocument()));
out.close();
// open an input stream for result
BufferedReader in =
new BufferedReader(new InputStreamReader(
receiverConnection.getInputStream()));
// process result: "OK" indicates success,
// "Exception" indicates input stream contains
// serialized exception.
...
in.close();
} catch (Throwable e) {
e.printStackTrace(System.err);
}
}
Note that the above documentReady()
method uses the remoteURL
property to get the URL of the CGI script on the server.
To be compatible with HTTP, the CGI script class wraps the receiver’s output with the string "Content-type: text/html"
. The script then invokes the Java Virtual Machine and the receiverMain()
on the server. Here’s the code for the CGI script on the server:
#!/bin/sh
echo "Content-type: text/html"
echo
/usr/local/bin/java org.xbeans.communication.http.receiverMain 2>&1
The main()
method simply instantiates the receiver and invokes receiveDocument()
on the receiver.
import org.xbeans.communication.stdio.receiver.*;
public class receiverMain {
static Bean theReceiver = new Bean();
public static void main(String[] args) {
theReceiver.receiveDocument();
}
}
Finally, here’s a code fragment from receiveDocument()
to recreate the DOM representation and pass it on for further processing. This code depends on IBM’s XML parser.
// Create a parser
DOMParser parser = new DOMParser();
// Invoke the parser.
try {
parser.parse(new InputSource(System.in));
} catch (Throwable e) {
throw new XbeansException(
"",
"receiver",
"io error parsing incoming document",
"io error parsing incoming document "+e
);
}
// Pass the document on to the next bean
DOMListener.documentReady(new DOMEvent(this,parser.getDocument()));
Approach 2: Serialized representation via Java RMI
The second approach uses Java RMI and DOM serialization to transfer the DOM from the sender to the receiver. Figure 4 illustrates this approach.
The following code implements the sender using Java RMI to communicate with the receiver.
public void documentReady(DOMEvent evt)
throws XbeansException {
if (DOMListener==null) {
try {
DOMListener =
(DOMListener)Naming.lookup(getReceiverName());
} catch (Exception e) {
throw new XbeansException(
evt.getDocument().getNodeName(),
"sender",
"error obtaining remote receiver",
"The name may be wrong or the network may be down."
);
}
}
DOMListener.documentReady(evt);
}
The following code implements the receiver using Java RMI. The setName()
method exports the receiver to the RMI registry and the documentReady()
method simply passes on the received document to the next component.
public void setReceiverName(String newName) {
try {
if (receiverName!=null) Naming.unbind(receiverName);
receiverName = newName;
Naming.rebind(receiverName, this );
} catch( Exception e ) {
System.out.println( e );
}
}
public void documentReady(Document incomingDocument) throws RemoteException, XbeansException {
if (DOMListener==null) {
throw new XbeansException(
incomingDocument.getNodeName(),
"rmiReceiver",
"next component not established",
"The component needs to be configured."
);
}
DOMListener.documentReady(new DOMEvent(this,incomingDocument));
}
Approach 3: CORBA-IIOP
The third approach uses CORBA-IIOP (CORBA over Internet Inter-ORB Protocol) as a transport. The Object Management Group (OMG) currently has a request for proposal (Resources) for extending the Interface Definition Language (IDL) to include an XML data type. Thus, in the future, CORBA products will be able to transmit XML natively. Figure 5 illustrates this approach.
Since today’s CORBA products cannot represent XML natively, the code presented here uses Java serialization of the DOM and communicates it to the receiver as a sequence of octets. It is very similar to the Java RMI approach. The only difference is that the distributed object infrastructure is provided by CORBA systems.
The following OMG IDL gives the interface between the CORBA implementation of the sender and the receiver:
exception RemoteReceiverException {
string remoteIdentifier;
string documentName;
string componentName;
string message;
string moreMessage;
};
typedef sequence<octet> byteArray;
interface XMLReceiver {
void documentReady(in byteArray serializedDocument)
raises(RemoteReceiverException);
};
The following code implements the sender using Java serialization of the DOM and CORBA:
public void documentReady(DOMEvent evt)
throws XbeansException {
Document documentToSend = evt.getDocument();
try {
ByteArrayOutputStream bastream =
new ByteArrayOutputStream();
ObjectOutputStream p =
new ObjectOutputStream(bastream);
p.writeObject(documentToSend);
p.flush();
org.omg.CORBA.ORB orb =
org.omg.CORBA.ORB.init(
new String[0],
System.getProperties());
XMLReceiver receiver =
urlToObject(orb,getReceiverURL());
receiver.documentReady(bastream.toByteArray());
} catch (RemoteReceiverException rre) {
throw new XbeansException(
rre.remoteIdentifier,
rre.documentName,
rre.componentName,
rre.message,
rre.moreMessage
);
} catch (Throwable e) {
throw new XbeansException(
"",
"sender",
"error sending document "+e,
"error sending document "+e
);
}
}
The following code implements the receiver using Java serialization of the DOM and CORBA:
public void documentReady(byte[] serializedDocument)
throws RemoteReceiverException {
// Deserialize the byte stream.
ByteArrayInputStream bais =
new ByteArrayInputStream(serializedDocument);
Document theDocument;
try {
ObjectInputStream ois = new ObjectInputStream(bais);
theDocument = (Document)ois.readObject();
} catch(Throwable e) {
throw new RemoteReceiverException(
corbaName,
"incoming document",
"receiver",
"error deserializing document",
"error deserializing document"+e
);
}
// Pass on the document to the listener
try {
local.DOMListener.documentReady(
new DOMEvent(this,theDocument));
} catch (XbeansException xbe) {
throw new RemoteReceiverException(
xbe.remoteIdentifier(),
xbe.documentName(),
xbe.componentName(),
xbe.message(),
xbe.moreMessage()
);
}
}
Analysis
Table 1 summarizes the advantages (+) and disadvantages (-) of each approach. The description of each follows.
Standard Web server | Serialized DOM, RMI | CORBA | |
Space | + | – | – |
Time | + | – | – |
Ubiquity of infrastructure | + | – | – |
Programming and execution model | – | + | + |
Interoperability | + | – | + |
Table 1. Summary of three approaches to communicating XML
Space and time
Surprisingly, the performance results presented below indicate that a textual representation of XML is a far more efficient representation than a serialized DOM representation. Also, the time required to externalize a DOM representation and reparse the textual form is cheaper than the direct Java serialization and deserialization of the DOM. See the “Performance results” section below for more details.
Ubiquity of infrastructure
The advantage of the standard Web-server approach is that the infrastructure to support it is ubiquitous. The CGI script can be executed from most Web servers. Further, the receiver is easily identified through a URL. The naming infrastructure is already in place. The RMI solution, on the other hand, requires the RMI registry.
The CORBA solution requires the installation of a CORBA Object Request Broker on the server. Further, the CORBA implementation of the sender uses a URL naming scheme whereby the CORBA object reference of the receiver, represented as a string, is associated with a URL and translated at the client.
Programming and execution model
CORBA and RMI support a natural Java-client-to-Java-server solution. There is no CGI script and no need to read encoded exceptions from standard input. Further, there is no need to start a Java Virtual Machine each time a sender wants to communicate some XML. Both CORBA and RMI support automatic activation of the receiver.
Interoperability
The Java RMI solution works only between Java code, while the Web server, and in theory the CORBA solution, can work between programming languages. (The CORBA solution, in theory, because adding XML support to the CORBA standard is in progress. Again, the code shown here uses Java serialization.)
But even if the requirement that both the client and the server must be in Java code were not a problem, another interoperability obstacle exists with Java serialization of the DOM. Java serialization requires that the same implementation of the DOM be running at both the client and the server.
Performance results
Understanding the time and space requirements of the different approaches to sending XML across process boundaries is important when choosing which approach to take. First, I’ll present the results of my experiments measuring the space used for an XML document in text versus an XML document represented by the serialized DOM. Then I’ll present the results of our timing tests for each case.
Space results
The space required to represent an XML document depends on the structure implied by the DTD and the amount of data in the document. Obviously, an XML document with one simple tag and megabytes of textual data is going to have almost identical space requirements in both representations.
I am interested in highly structured data. The performance tests compared the XML textual representation of a purchase order with the serialized DOM representation of it. The performance tests used IBM’s XML parser and DOM implementation, version 2.0.13.
The tests used a purchase order since it is a typical business-to-business application of XML with a fairly rich XML structure. It represents information extracted from several tables in a relational database. Here is the DTD for the purchase order:
<?xml encoding="US-ASCII"?>
<!ELEMENT orders (order)*>
<!ELEMENT order (header,item+,total)>
<!ELEMENT header (billing_info,shipping_info)>
<!ELEMENT billing_info (name,address,credit_card)>
<!ELEMENT shipping_info (name,address)>
<!ELEMENT name (given,family)>
<!ELEMENT address (street,city,state,zipcode,country,phone)>
<!ELEMENT item (product_id,product_name,quantity,price)>
<!ELEMENT credit_card (#PCDATA)>
<!ELEMENT given (#PCDATA)>
<!ELEMENT family (#PCDATA)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT zipcode (#PCDATA)>
<!ELEMENT country (#PCDATA)>
<!ELEMENT phone (#PCDATA)>
<!ELEMENT product_id (#PCDATA)>
<!ELEMENT product_name (#PCDATA)>
<!ELEMENT quantity (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT total (#PCDATA)>
Here’s an XML document with a single purchase order:
<?xml version="1.0"?>
<!DOCTYPE orders SYSTEM "orders.dtd">
<orders>
<order>
<header>
<billing_info>
<name>
<given>John</given>
<family>Doe</family>
</name>
<address>
<street>555 Main Street</street>
<city>Mill Valley</city>
<state>California</state>
<zipcode>94520</zipcode>
<country>USA</country>
<phone>707 555-1000</phone>
</address>
<credit_card>5555 5555 5555 5555</credit_card>
</billing_info>
<shipping_info>
<name>
<given>John</given>
<family>Doe</family>
</name>
<address>
<street>555 Main Street</street>
<city>Mill Valley</city>
<state>California</state>
<zipcode>94520</zipcode>
<country>USA</country>
<phone>707 555-1000</phone>
</address>
</shipping_info>
</header>
<item>
<product_id>5555555</product_id>
<product_name>Widget</product_name>
<quantity>100</quantity>
<price>.25</price>
</item>
<total>25.00</total>
</order>
</orders>
Notice that the DTD allows multiple orders to be given in a single XML document. For my performance tests, I enlarged the XML document by including the same order n times. Table 2 presents the space results.
Number of orders | Bytes to represent as XML text | Bytes to represent as serialized DOM | Ratio (serialized/text) |
1 | 1,048 | 7,278 | 6.9 |
5 | 4,900 | 29,310 | 5.9 |
10 | 9,715 | 56,850 | 5.9 |
100 | 9,6385 | 552,570 | 5.7 |
500 | 481,585 | 2,755,770 | 5.7 |
Table 2. Space results
Timing results
I now compare the translation times for the purchase order. In particular, I give the time it takes to externalize a DOM representation to XML text and then reparse it into a DOM, as well as the time it takes to serialize and deserialize the DOM representation. This effectively compares the performance of the senders and receivers, without considering the actual communication of the data.
I performed the tests on a Compaq Pentium 3 system running at 450 MHz with 128 MB of RAM. The system was running Microsoft Windows NT 4.0, Sun’s Java 1.2 Virtual Machine, and IBM’s XML parser and DOM implementation, version 2.0.13. A time is the average of 50 trials.
Number of orders | Milliseconds to write DOM as XML text and reparse it to DOM | Milliseconds to serialize and deserialize DOM | Ratio (serialize/parse) |
1 | 191 | 478 | 2.5 |
5 | 199 | 534 | 2.7 |
10 | 231 | 603 | 2.6 |
100 | 862 | 2228 | 2.6 |
500 | 4707 | 10107 | 2.1 |
Table 3. Timing results
Other approaches
You can take other approaches when sending XML across process boundaries. To reduce the space required, you can compress both the textual representation and the serialized DOM representation.
You can use other transports besides HTTP, RMI, and CORBA; for example, you can use email or FTP. In environments with shared network file systems, the sender and receiver can exchange XML data via the file system.
Conclusion
As you’ve seen, there are several approaches to sending XML from here to there in a Java distributed application. Each approach makes different performance and interoperability tradeoffs.
It is important to abstractthe communication of XML from the rest of the distributed application logic. That is, the code that sends and receives XML should be separate from the code implementing the application logic. By packaging the code as software components, you can change the implementations of the sender and the receiver without affecting the rest of the application.
You can obtain the complete three sender and receiver implementations described here for free at Xbeans.org.
The authors of this month’s server-side Java computing articles will be holding a free online seminar on February 10 at 10 a.m. PST. Register to join at https://seminars.jguru.com.