XML documents on the run, Part 1

SAX speeds through XML documents with parse-event streams

One of the oldest approaches to processing XML documents in Java also proves one of the fastest: parse-event streams. That approach became standardized in Java with the SAX (Simple API for XML) interface specification, later revised as SAX2 to include support for XML Namespaces.

Read the whole “XML Documents on the Run” series:

  • Part 1: SAX speeds through XML documents with parse-event streams
  • Part 2: Better SAX2 handling and the pull-parser alternative
  • Part 3: How do SAX2 parsers perform compared to new XMLPull parsers?

Event-stream processing offers other advantages beyond just speed. Because the parser processes the document on the fly, you can handle it as soon as you read its first part. Other approaches generally require you to parse the complete document before you start working with it — fine if the document comes off a local disk drive, but if the document is sent from another system, parsing the complete document can cause significant delays.

Event-stream processing also eliminates any document size limits. In contrast, approaches that store the document’s representation in memory can run out of space with very large documents. Setting a hard limit on a real-world document’s size is often difficult, and potentially a major problem in many applications.

A note on the source code

This article features two example source code files: stock.jar and option.jar, both found in a downloadable zip file in Resources. Each jar file includes full example implementations, along with sample documents and test driver programs. To try an example, create a new directory, then extract the jar’s files to that directory with jar xvf stock.jar or jar xvf option.jar. The readme.txt file gives instructions for setting up and running the test drivers.

The event view

Parsers with event-stream interfaces deliver a document one piece at a time. Think of the document’s text as spread out in time, as it would be if read from a stream. The parser looks for significant document components (start and end tags, character data, and so on) in the text, generating parse events for each.

For example, here’s a simple document:

<author>
  <first-name>Dennis</first-name>
  <last-name>Sosnoski</last-name>
</author>

The table shows the parse-event sequence a SAX2 parser would generate for this document (though the parser can divide up the character data reported by characters events differently than I’ve shown, as I discuss when I get to the actual code).

Parse events for document

Text processed
Parse event
""
startDocument()
"<author>"
startElement("author")
"n "
characters("n ")
"<first-name>"
startElement("first-name")
"Dennis"
characters("Dennis")
"</first-name>"
endElement("first-name")
"n "
characters("n ")
"<last-name>"
startElement("last-name")
"Sosnoski"
characters("Sosnoski")
"</last-name>"
endElement("last-name")
"n"
characters("n")
"</author>"
endElement("author")

Notice in the table that the parse events include both start of element and end of element notifications — important information for your program because it lets you track the document’s nested structure. Without the end notifications, you couldn’t know which elements or character data are part of the content of some earlier element. Also note that the parse events include all the character data in the document, even the whitespace sequences most people would consider unimportant.

With the event-driven approach, your application turns control over to the parser, passing it the document (as a stream or URI/URL). The parser reads the document, then breaks it into components, calling a method in a handler class supplied by your program to report each event. That isn’t the only way of working with parse-event streams (as I’ll show in Part 2), but it’s the mostly widely used approach at present.

SAX and SAX2

Most event-stream parsers for XML in Java first used SAX. Unlike most other Internet and Web standards, SAX originally materialized without the official involvement of any sponsoring standards organization. Instead, it developed through a series of discussions, prototypes, and eventual consensus, coordinated by David Megginson on the XML-DEV mailing list.

SAX2 extends the SAX API to include full support for XML Namespaces. It also incorporates fixes to the original SAX interface. Most current parsers implement the SAX2 interface natively, though the original SAX interface is available if desired. New development should probably use the SAX2 interface even if Namespaces are not required, if for no other reason than to avoid deprecated APIs. The example code in this article follows that approach.

Event-driven programming

Enough of the background material, let’s plunge in to programming the interface. You first want to get a parser, in the form of an org.xml.sax.XMLReader instance. These parser instances are serially reusable, meaning you can use one for parsing as many documents as you like, but only one document at a time. Indeed, if you’re writing a simple single-threaded application, you can simply use the same instance over and over.

Usually you get the XMLReader by calling the static org.xml.sax.helpers.XMLReaderFactory.createXMLReader() method (you need to have a SAX2 parser implementation in your classpath for this to work, of course; see Resources for a link to the SAX2 project page where you can find a list of parsers supporting SAX2). createXMLReader() lets you specify a particular implementation class, or you can simply use the default one defined by a system property.

Once you have the XMLReader, you can set and check a variety of options for the parser. You can also hook up various handler types for the parse events. Each handler type must implement a particular interface. For your purposes, you’ll build on the handy handler base class defined by SAX2, org.xml.sax.helpers.DefaultHandler, which supplies default implementations for the full handler set. By using that as a base class, you can override only the methods you’re interested in, while not worrying about the rest.

If you’re working with Sun’s JAXP (Java API for XML Parsing) 1.1 or higher, you can get your SAX2 parser instance through the JAXP API. With this approach, you first call the static javax.xml.parsers.SAXParserFactory.newInstance() method to get an SAXParserFactory instance, then use that instance’s newSAXParser() method to get a javax.xml.parsers.SAXParser instance. That gives you an interface for parsing a document using a specified DefaultHandler.

Both approaches support a variety of options for the parser type you want to create, including whether or not you want to validate the parsed documents. Let’s ignore most of those options (and the whole validation issue) for this introduction to SAX2 parsing, but you can find the full details on the official SAX2 and JAXP sites.

One option I won’t ignore is the namespace handling. Directly created SAX2 parsers default to namespace-handling enabled, while those created through JAXP have it disabled by default. This option affects how element names are reported, even if you don’t use namespaces in your documents. For the sample code in this article, I assume that namespace handling is enabled. The easiest way to enable it with JAXP is to call the SAXParserFactory.setNamespaceAware() method with a true value before creating your parser.

So far this doesn’t sound too bad, but the interesting part starts when you call the parser with a document. The parser won’t return from that call until parsing completes, but in the meantime, it’ll call your handler methods for each and every parse event of the types you registered to handle. Your handler code makes sense of the call sequence and interprets it for your application.

Writing event-driven programs, as this handler technique is known, can be difficult. The problem: event streams turn the normal program structure inside out; instead of your program running the operation and requesting what it wants from the document, it hooks to an event stream hose that pushes the document at it, one small piece at a time.

Most applications need more structure than basic event streams provide. If you’re working with an event-based parser, you must provide that structure by keeping state information that tracks your location in the document. Your state-information needs depend on the structure level you’re working with. Using an event-based approach to handling your documents will be easiest when you work with simple structures within the document.

Watch the market

As an example, we’ll work with a document that gives the history of stock trades over some span of time:

<?xml version="1.0"?>
<trade-history>
  <stock-trade>
    <symbol>SUNW</symbol>
    <time>08:45:19</time>
    <price>86.24</price>
    <quantity>500</quantity>
  </stock-trade>
  <stock-trade>
    <symbol>MSFT</symbol>
    <time>08:45:20</time>
    <price>22.26</price>
    <quantity>1000</quantity>
  </stock-trade>
</trade-history>

For each trade, the document above includes the symbol for traded stock, the time the trade occurred, the price, and the number of shares, all as content of specific elements. The above sample shows only two trades (taking place at some unspecified future date), but you could easily extend it to any number of trades over any time period. In particular, it makes sense to use such a format in a ticker stream that provided a feed of all trades on an exchange during a trading day.

Suppose you want to parse such a stream and track all stock information, including high, low, and last trade prices for the day, along with share and dollar volumes, for each stock traded. An event-stream parser approach should give you what you need — you can handle each individual stock-trade element as it’s received, immediately updating your accumulated information so that it’s always kept up to date. For your hypothetical ticker stream, this immediate handling is important, since the document won’t end until the market closes. If you couldn’t access the information until then, it might be too late to do you any good!

The stream may include many trades for each stock, so you’ll need some form of data structure to hold onto the tracking information. Here’s a class to handle that aspect:

public class StockTrack
{
    // Map of stock symbols to tracking information
    protected static HashMap s_symbolMap = new HashMap();
    
    // Instance variables for information on a particular stock
    protected String m_stockSymbol;     // Symbol for this stock
    protected String m_lastTime;        // Time of last trade
    protected double m_highPrice;       // High trade price
    protected double m_lowPrice;        // Low trade price
    protected double m_lastPrice;       // Last trade price
    protected int m_totalShares;        // Total number of shares traded
    protected double m_totalDollars;    // Total dollar volume traded
    
    protected StockTrack(String sym) {
        m_stockSymbol = sym;
        m_lowPrice = Double.MAX_VALUE;
    }
    
    public String getSymbol() {
        return m_stockSymbol;
    }
    
    public String getLastTime() {
        return m_lastTime;
    }
    
    public double getHighPrice() {
        return m_highPrice;
    }
    
    public double getLowPrice() {
        return m_lowPrice;
    }
    
    public double getLastPrice() {
        return m_lastPrice;
    }
    
    public int getShareVolume() {
        return m_totalShares;
    }
    
    public int getDollarVolume() {
        return m_totalDollars;
    }
    
    public static StockTrack getTrack(String sym) {
        StockTrack track = (StockTrack)s_symbolMap.get(sym);
        if (track == null) {
            track = new StockTrack(sym);
            s_symbolMap.put(symbol, track);
        }
        return track;
    }
    
    public static void recordTrade(String sym, String time, double price, 
        int shares) {
        StockTrack track = getTrack(sym);
        track.m_lastTime = time;
        if (track.m_highPrice < price) {
            track.m_highPrice = price;
        }
        if (track.m_lowPrice > price) {
            track.m_lowPrice = price;
        }
        track.m_lastPrice = price;
        track.m_totalShares += shares;
        track.m_totalDollars += shares*price;
    }
}

In StockTrack, a static HashMap links stock symbols with their respective tracking information. Along with some access methods for the member variables, the class includes a protected constructor and a pair of public client access methods. getTrack() looks up the tracking information for a supplied stock symbol, creating a new instance of the class for that stock symbol if no trades have been recorded yet. recordTrade() records a stock trade, first finding the stock information with getTrack(), then updating the information to reflect the new trade.

You should now have all you need for tracking. Next, let’s look at how to interface between a parse-event handler and the tracking code.

Do what comes naturally

To most directly handle event streams, use a simple conditional check within the start- and/or end-element handling. If you’re using attribute values, locate the check in the start element, since it contains the attribute values. In contrast, if you’re using character data content, you’ll typically locate the check in the end-element handling so you can properly handle the content text. If you’re working with both attributes and content text, you’ll probably need to check in both places.

SAX2 delivers content event notifications with the org.xml.sax.ContentHandler interface. For your application, you’ll work with just three of the interface’s event handler methods:

  • startElement(): Called at each element’s start
  • endElement(): Called at each element’s end (including empty elements)
  • characters(): Called for character data

The trade history document uses content text rather than attributes. You’ll want to collect the character-data content from each of the lowest-level elements, then save the content temporarily until you see the <stock-trade> element’s end, which signals you’ve seen everything you need to see.

Collecting the character-data content can prove more involved than it sounds, though. The SAX2 parser calls the characters() method for any character-data content not known by the parser to be ignorable (as whitespace can sometimes be, if the document has a DTD (document type definition)). In our document, the parser calls the characters() method for all the document text not part of the markup (the first line XML declaration, along with the start and end tags, for this document), including whitespace.

Each call to characters() will normally report a character sequence, although that isn’t a SAX2 requirement. The SAX2 parser can call the method once for each character, if it finds such an approach convenient. To avoid potential problems, assume the data is passed in pieces, and assemble it yourself.

For your application, you may also want to allow arbitrary whitespace to be inserted before and after your content text. This would let you properly handle documents formatted along these lines:

...
<stock-trade>
  <symbol>
    SUNW
  </symbol>
  <time>
    08:45:19
  </time>
  <price>
    286.24
  </price>
  <quantity>
    500
  </quantity>
</stock-trade>
...

You can handle extra whitespace by trimming it off content text before processing. Trimming whitespace certainly isn’t an XML requirement, and for many applications, it could prove dangerous. But for applications where you know leading and trailing whitespace is insignificant, trimming it can avoid potential compatibility problems with different programs used to generate and process XML documents. This fits your sample stock tracking applications, so go ahead and strip leading and trailing whitespace from your content in the example code.

With all this taken into account, here’s what a simple SAX2 handler for your application might look like:

public class StockStreamHandler extends DefaultHandler
{
    protected StringBuffer m_contentBuffer =  // Character data buffer
        new StringBuffer();
    protected String m_stockSymbol;         // Symbol for stock traded
    protected String m_tradeTime;           // Timestamp on trade
    protected double m_tradePrice;          // Price at which stock traded
    protected int m_sharesTraded;           // Number of shares traded
    
    public void startElement(String uri, String lname, String qname,
        Attributes attributes) {
        m_contentBuffer.setLength(0);
    }
    
    public void characters(char[] chars, int start, int length) {
        m_contentBuffer.append(chars, start, length);
    }
    
    public void endElement(String uri, String lname, String qname) {
        if (lname.equals("stock-trade")) {
            StockTracker.recordTrade(m_stockSymbol, m_tradeTime, 
                m_tradePrice, m_sharesTraded);
        } else {
            String content = m_contentBuffer.toString().trim();
            if (lname.equals("symbol")) {
                m_stockSymbol = content;
            } else if (lname.equals("time")) {
                m_tradeTime = content;
            } else if (lname.equals("price")) {
                m_tradePrice = Double.parseDouble(content);
            } else if (lname.equals("quantity")) {
                m_sharesTraded = Integer.parseInt(content);
            }
        }
    }
}

In the code above, the startElement() method clears the StringBuffer that accumulates character data. The characters() method appends to the accumulated character data in the StringBuffer, and the endElement() method handles the content. This content handling step works differently for each element type.

The lower-level elements contained within the <stock-trade> element each supply a piece of information you need to properly record the trade. You need the complete set of pieces to actually do the recording, though, so the endElement() handling for each of these saves the information to an instance variable for later use. Since the information is always character-data content, the code converts the accumulated character data to a String, strips leading and trailing whitespace with the trim() method, and, if necessary, converts it to a numeric value before saving it.

When you see the end of the <stock-trade> element, you know that the contained elements have already been processed. The only item remaining: record the trade using the information saved from contained elements.

In the code, I assume the default SAX2 parser setting of the Namespaces property to true. That setting makes a big difference in how you identify the element in startElement() and endElement(). When the property is true, the lname parameter gives the local name for the element (without any Namespace prefix). In the case of the example document, you’re not actually using Namespaces; if you were (with element names of the form “ns:name”), you’d need to use the combination of uri and lname to identify the processed element. With the Namespaces property false, the qname parameter is used instead. In this case, qname gives the literal element name including prefix (“ns:name”, for example). JAXP makes this property false by default, so you can easily become confused and wind up with code that doesn’t recognize your elements.

Complicate the structure

The direct approach to event-based processing outlined above works well, and I’ve seen it used successfully in many applications. However, it leads to messy code as the handled document structures get increasingly complex. Suppose you want to handle a stream that containS more than one higher-level element type, say, option trades in addition to stock trades? These could be represented by <option-trade> elements:

  <option-trade>
    <symbol>SUNW</symbol>
    <time>08:45:19</time>
    <option-type>call</option-type>
    <strike-price>100</strike-price>
    <expiration-month>9</expiration-month>
    <trade-price>13.47</price>
    <quantity>500</quantity>
  </option-trade>

Next, here’s code modified from the earlier example to handle multiple higher-level element types in the stream:

public class StockStreamHandler extends DefaultHandler
{
    ...
    protected boolean m_isCallOption;       // Call option flag
    protected double m_strikePrice;         // Price for option exercise
    protected int m_expirationMonth;        // Option expiration month
    ...
    public void endElement(String uri, String lname, String qname) {
        if (lname.equals("stock-trade")) {
            StockTracker.recordTrade(m_stockSymbol, m_tradeTime, 
                m_tradePrice, m_sharesTraded);
        } else if (lname.equals("option-trade")) {
            OptionTracker.recordTrade(m_stockSymbol, m_isCallOption, 
                m_strikePrice, m_expirationMonth, m_tradeTime, 
                m_tradePrice, m_sharesTraded);
        } else {
            String content = contentBuffer.toString().trim();
            if (lname.equals("symbol")) {
                m_stockSymbol = content;
            } else if (lname.equals("time")) {
                m_tradeTime = content;
            } else if (lname.equals("price")) {
                m_tradePrice = Double.parseDouble(content);
            } else if (lname.equals("quantity")) {
                m_sharesTraded = Integer.parseInt(content);
            } else if (lname.equals("type")) {
                m_isCallOption = content.equals("call");
            } else if (lname.equals("strike-price")) {
                m_strikePrice = Double.parseDouble(content);
            } else if (lname.equals("trade-price")) {
                m_tradePrice = Double.parseDouble(content);
            } else if (lname.equals("expiration-month")) {
                m_expirationMonth = Integer.parseInt(content);
            }
        }
    }
}

The example code becomes messy with just two top-level element types, since the endElement() method must handle the different lower-level elements types. If your code needs to handle several different element types, perhaps including different nesting levels, it will be difficult to write and maintain with this approach.

More flexible processing approaches

As an alternative, you can build more flexible code for event-based processing by adding a level of indirection. You can add indirection in several ways. For instance, you can define your own simplified interface just for handling the main event of interest to your application:

public interface EndElementHandler
{
    public void endElement(String lname, String content);
}

You can then implement this separately for each higher-level element:

public class StockTradeHandler implements EndElementHandler
{
    protected String m_stockSymbol;         // Symbol for stock traded
    protected String m_tradeTime;           // Timestamp on trade
    protected double m_tradePrice;          // Price at which stock traded
    protected int m_sharesTraded;           // Number of shares traded
    
    public void endElement(String lname, String content) {
        if (lname.equals("stock-trade")) {
            StockTracker.recordTrade(m_stockSymbol, m_tradeTime, 
                m_tradePrice, m_sharesTraded);
        } else {
            if (lname.equals("symbol")) {
                m_stockSymbol = content;
            } else if (lname.equals("time")) {
                m_tradeTime = content;
            } else if (lname.equals("price")) {
                m_tradePrice = Double.parseDouble(content);
            } else if (lname.equals("quantity")) {
                m_sharesTraded = Integer.parseInt(content);
            }
        }
    }
}
public class OptionTradeHandler implements EndElementHandler
{
    protected String m_stockSymbol;         // Symbol for stock traded
    protected String m_tradeTime;           // Timestamp on trade
    protected int m_sharesTraded;           // Number of shares traded
    protected boolean m_isCallOption;       // Call option flag
    protected double m_strikePrice;         // Price for option exercise
    protected double m_tradePrice;          // Price of option sale
    protected int m_expirationMonth;        // Option expiration month
    
    public void endElement(String lname, String content) {
        if (lname.equals("option-trade")) {
            OptionTracker.recordTrade(m_stockSymbol, m_isCallOption, 
                m_strikePrice, m_expirationMonth, m_tradeTime, 
                m_tradePrice, m_sharesTraded);
        } else {
            if (lname.equals("symbol")) {
                m_stockSymbol = content;
            } else if (lname.equals("time")) {
                m_tradeTime = content;
            } else if (lname.equals("quantity")) {
                m_sharesTraded = Integer.parseInt(content);
            } else if (lname.equals("type")) {
                m_isCallOption = content.equals("call");
            } else if (lname.equals("strike-price")) {
                m_strikePrice = Double.parseDouble(content);
            } else if (lname.equals("trade-price")) {
                m_tradePrice = Double.parseDouble(content);
            } else if (lname.equals("expiration-month")) {
                m_expirationMonth = Integer.parseInt(content);
            }
        }
    }
}

To finish, change the basic handler class so you can set the appropriate higher-level element handler:

public class StockStreamHandler extends DefaultHandler
{
    protected StockTradeHandler m_stockHandler = new StockTradeHandler();
    protected OptionTradeHandler m_optionHandler = new OptionTradeHandler();
    protected EndElementHandler m_endHandler;
    ...
    public void startElement(String uri, String lname, String qname,
        Attributes attributes) {
        m_contentBuffer.setLength(0);
        if (lname.equals("stock-trade")) {
            m_endHandler = stockHandler;
        } else if (lname.equals("option-trade")) {
            m_endHandler = optionHandler;
        }
    }
    ...    
    public void endElement(String uri, String lname, String qname) {
        String content = m_contentBuffer.toString().trim();
        m_endHandler.endElement(lname, content);
    }
}

The above approach lets you keep the details of different element structures in separate classes, giving a cleaner, easier-to-maintain structure as you add new element types and modify existing ones through the course of a project.

You can extend the new approach in several ways for even more flexibility. First, you could use a HashMap to associate element names with handler classes in the StockStreamHandler.startElement() method, so, rather then hard coding your handlers, they can be configured at runtime instead. Second, you could enhance the simplified interface to include start-element handling and handler nesting (with one handler turning control to a handler for a contained element, which then returns control to the original handler when the contained element ends), allowing us to nest handlers for multiple levels of child elements. I’ll explore the second approach further in the next article in this series.

For other approaches to handling event-based parsing in your programs, check out the prior JavaWorld SAX articles listed in Resources.

SAX2 saves the day

In this article, you’ve learned how to use SAX2 parsers for handling XML in your applications, including some of the problems you’re likely to encounter and some of their solutions.

In the next article in this series, I’ll take the event-based programming approach further with an enhanced handler design that adds more flexibility to your programs. I’ll also cover the pull-parser approach. Pull parsing resembles SAX/SAX2 event-stream parsing, but it gives your program control over the stream, which gives you all the advantages of event-stream parsing without the complexities of event-driven programming. Be sure to check back then for the rest of the story on parse-event stream processing of XML in Java!

Dennis Sosnoski is an
enterprise architecture consultant and developer with more than 30
years’ experience. As the president and lead consultant of
Seattle-based consulting company Sosnoski Software Solutions,
Inc., he’s spent the last four years designing and building
enterprise applications in Java. Dennis started working with XML in
Java two-and-a-half years ago, and in addition to originating both
commercial and open source projects for XML, he’s chaired the
Seattle Java-XML SIG since its founding in 1999.

Source: www.infoworld.com