Carry your state
Standard techniques for contingency planning don’t map well to service-oriented communications
CONCEPTS SUCH AS BPM (business process management), federated computing, and Web services would not exist were it not for reliable, low-latency networks. Such service-oriented architectures entail calling out to multiple systems. But what happens when a system or network loses its connection, overloads, or hands back garbled data?
Standard contingency planning is coarse-grained: an entire site, server, or application is failed over. After fail over, existing sessions (such as database queries and credit card authorizations that were already underway) are usually dropped.
But the coarse-grained contingency model runs aground in a service-oriented design. One server might run five applications that publish a total of 30 Web services interfaces. If three of those interfaces fail, the availability solution may not know the application is in trouble.
Database managers and messaging systems use distributed transactions to guard against data loss. It’s one thing to roll back a five-step transaction that fails at step three; but it’s quite another to report failure on a 50-step business process that hangs at step 49. The failure-handling models built into the BPM systems I’ve used are primitive; rollback is not an option, and if the state of the entire process is lost (because the orchestration or directory server goes down, for example), the outcome is unpredictable.
Like the Web itself, Web services should be stateless by design. But already, distributed applications expect servers to save data from one call to the next. I see plenty of Web services apps that use callbacks to make the application less sensitive to latency. A callback says, “I’m not going to wait; you call me back when you’re done processing my request.” It’s great that the application can do other things instead of looping along, awaiting a response, but a Web service can’t call you back if the server fails over before it gets a chance to answer. If the client hiccups and loses its state before the return call, it won’t know what to do with the callback when it arrives. These are critical issues, but tools, infrastructure, and developers often leave them to chance.
One solution is to use these technologies with their natural limitations in mind. A lengthy BPM workflow should be broken into subprocesses so it doesn’t always have to restart at step 1, and each subprocess’ business logic should provide the equivalent of rollback in the event of failure. All saved state information must be verifiably replicated across redundant servers; a state-dependent Web service shouldn’t report success until replication completes. In other words, applications have to take ownership of contingency planning.
An alternative, which I actually prefer, is to include complete state data — even the entire process flow — in every call. A callback would have all the data needed to rebuild the state of a restarted client. If it looks like that’s too much data to push around your network, maybe that’s a sign that your application is too complex and too dependent on state.