Persist data with Java Data Objects, Part 1
Grasp the qualities behind an ideal persistence layer
“Everything should be made as simple as possible, but not simpler.”
Albert Einstein
The need to persist data created at runtime is as old as computing. And the need to store object-oriented data cropped up when object-oriented programming became pervasive. Currently, most modern, nontrivial applications use an object-oriented paradigm to model application domains. In contrast, the database market is more divided. Most database systems use the relational model, but object-based data stores prove indispensable in many applications. Plus, we also have legacy systems that we often need to interface to.
This article identifies the issues associated with data persistence in transactional middleware environments, such as J2EE (Java 2 Platform, Enterprise Edition), and shows how Java Data Objects (JDO) solves some of those issues. This article provides an overview, not a detailed tutorial, and is written from the viewpoint of an application developer, not a JDO implementation designer.
Read the whole series on Java Data Objects:
- Part 1. Grasp the qualities behind an ideal persistence layer
- Part 2. Sun JDO vs. Castor JDO
Those Java developers, designers, and J2EE architects who work on systems that must store data in relational or object databases, or other storage media should read this article. I assume you have a basic knowledge of Java and some familiarity with object-relational issues and terminology.
Transparent persistence: Why bother?
More than a decade of continuous attempts to bridge object-oriented runtime and persistence point to several important observations (listed in order of importance):
- Abstracting away any persistence details and having a clean, simple, object-oriented API to perform data storage is paramount. We don’t want to handle persistence details and internal data representation in data stores, be they relational, object-based, or something else. Why should we deal with low-level constructs of the data-store model, such as rows and columns, and constantly translate them back and forth? Instead, we need to concentrate on that complex application we were required to deliver by yesterday.
- We want to use the plug-and-play approach with our data stores: We want to use different providers/implementations without changing a line of the application source code — and perhaps without modifying more than a few lines in the appropriate configuration file(s). In other words, we need an industry standard for accessing data based on Java objects, one that plays a role similar to the one JDBC (Java Database Connectivity) plays as an industry standard for accessing SQL-based data.
-
We want to use the plug-and-play approach with different database paradigms — that is, we want to switch from a relational database to an object-oriented one with minimal changes to the application code. Though nice to have, in practice, this capability is often not required.
One comment here: While relational databases enjoy the biggest market presence by far, providing a unified persistence API and allowing data-store providers to compete on implementation strengths makes sense, regardless of the paradigm these providers use. This approach might eventually help level the playing field between the two dominant database vendor groups: the well-entrenched relational camp and the struggling-for-market-share object-oriented camp.
The three discoveries listed above lead us to define a persistence layer, a framework that provides a high-level Java API for objects and relationships to outlive the runtime environment’s (JVM) lifespan. Such a framework must feature the following qualities:
- Simplicity
- Minimal intrusion
- Transparency, meaning the framework hides the data-store implementation
- Consistent, concise APIs for object storage/retrieval/update
- Transaction support, meaning the framework defines transactional semantics associated with persistent objects
- Support for both managed (e.g., application server-based) as well as unmanaged (standalone) environments
- Support for the necessary extras, such as caching, queries, primary key generation, and mapping tools
- Reasonable licensing fees — not a technical requirement, but we all know that poor economics can doom an excellent project
I detail most of the above qualities in the following sections.
Simplicity
Simplicity rates high on my list of required traits for any software framework or library (see this article’s opening quote). Developing distributed applications is already hard enough, and many software projects fail because of poor complexity (and, by extension, risk) management. Simple is not synonymous with simplistic; the software should have all the needed features that allow a developer to do his/her job.
Minimal intrusion
Every persistent storage system introduces a certain amount of intrusion into the application code. The ideal persistence layer should minimize intrusion to achieve better modularity and, thus, plug-and-play functionality.
For the purpose of this article, I define intrusion as:
- The amount of persistence-specific code splattered across the application code
- The need to modify your application object model by either having to implement some persistence interface — such as
Persistable
or the like — or by postprocessing the generated code
Intrusion also applies to object-oriented database systems and, although usually less of an issue there compared to relational data stores, it can vary significantly among ODBMS (object-oriented database management system) vendors.
Transparency
The persistent layer transparency concept is pretty simple: the application uses the same API regardless of the data-store type (data storage-type transparency), or the data-store vendor (data storage-vendor transparency). Transparency greatly simplifies applications and improves their maintainability by hiding data-store implementation details to the maximum extent possible. In particular, for the prevalent relational data stores, unlike JDBC, you don’t need to hardcode SQL statements or column names, or remember the column order returned by a query. In fact, you don’t need to know SQL or relational algebra, because they’re too implementation specific. Transparency is perhaps the persistence layer’s most important trait.
Consistent, simple API
The persistence layer API boils down to a relatively small set of operations:
- Elementary CRUD (create, read, update, delete) operations on first-class objects
- Transaction management
- Application- and persistence-object identities’ management
- Cache management (i.e., refreshing and evicting)
- Query creation and execution
An example of a PersistenceLayer
API:
public void persist(Object obj); // Save obj to the data store.
public Object load(Class c, Object pK); // Read obj with a given primary key.
public void update(Object obj); // Update the modified object obj.
public void delete(Object obj); // Delete obj from the database.
public Collection find(Query q); // Find objects that satisfy conditions of our query.
Transaction support
A good persistence layer needs several elementary functions to start, commit, or roll back a transaction. Here is an example:
// Transaction (tx) demarcation.
public void startTx();
public void commitTx();
public void rollbackTx();
// Choose to make a persistent object transient after all.
public void makeTransient(Object o)
Note: Transaction demarcation APIs are primarily used in nonmanaged environments. In managed environments, the built-in transaction manager often assumes this functionality.
Managed environments support
Managed environments, such as J2EE application servers, have grown popular with developers. Who wants to write middle tiers from scratch these days when we have excellent application servers available? A decent persistence layer should be able to work within any major application server’s EJB (Enterprise JavaBean) container and synchronize with its services, such as JNDI (Java Naming and Directory Interface) and transaction management.
Queries
The API should be able to issue arbitrary queries for data searches. It should include a flexible and powerful, but easy-to-use, language — the API should use Java objects, not SQL tables or other data-store representations as formal query parameters.
Cache management
Cache management can do wonders for application performance. A sound persistence layer should provide full data caching as well as appropriate APIs to set the desired behavior, such as locking levels, eviction policies, lazy loading, and distributed caching support.
Primary key generation
Providing automatic identity generation for data is one of the most common persistence services. Every decent persistence layer should provide identity generation, with support for all major primary key-generation algorithms. Primary key generation is a well-researched issue and numerous primary key algorithms exist.
Mapping, for relational databases only
With relational databases, a data mapping issue arises: the need to translate objects into tables, and to translate relationships, such as dependencies and references, into additional columns or tables. This is a nontrivial problem in itself, especially with complex object models. The topic of object-relational model impedance mismatch reaches beyond this article’s scope, but is well publicized. See Resources for more information.
The following list of extras related to mapping and/or relational data stores are not required in the persistence layer, but they make a developer’s life much easier:
- A GUI (graphical user interface) mapping tool
- Code generators: Autogeneration of DDL (data description language) to create database tables, or autogeneration of Java code and mapping files from DDL
- Primary key generators: Supporting multiple key-generation algorithms, such as UUID, HIGH-LOW, and SEQUENCE
- Support for binary large objects (BLOBs) and character-based large objects (CLOBs)
- Self-referential relations: An object of type
Bar
referencing another object of typeBar
, for example - Raw SQL support: Pass-through SQL queries
Example
The following code snippet shows how to use the persistence layer API. Suppose we have the following domain model: A company has one or more locations, and each location has one or more users. The following could be an example application’s code:
PersistenceManager pm =PMFactory.initialize(..);
Company co = new Company("MyCompany");
Location l1 = new Location1 ("Boston");
Location l2 = new Location("New York");
// Create users.
User u1 = new User("Mark");
User u2 = new User("Tom");
User u3 = new User("Mary");
// Add users. A user can only "belong" to one location.
L1.addUser(u1);
L1.addUser(u2);
L2.addUser(u3);
// Add locations to the company.
co.addLocation(l1);
co.addLocation(l2);
// And finally, store the whole tree to the database.
pm.persist(c);
In another session, you can look up companies employing the user Tom
:
PersistenceManager pm =PMFactory.initialize(...)
Collection companiesEmployingToms = pm.find("company.location.user.name="Tom"");
For relational data stores, you must create an additional mapping file. It might look like this:
<!DOCTYPE mapping PUBLIC ... >
<mapping>
<class name="com.numatica.example.Company" identity="companyID" key-generator="SEQUENCE">
<cache-type type="count-limited" capacity="5"/>
<description>Company</description>
<map-to table="Companies"/>
<field name="companyID"type="long">
<sql name="companyID" type="numeric"/>
</field>
<field name="name" type="string">
<sql name="name" type="varchar"/>
</field>
<field name="locations" type="com.numatica.example.Location" collection="arraylist">
</field>
</class>
<class name="com.numatica.example.Location "identity="locationID"
key-generator="SEQUENCE">
<cache-type type="unlimited"/>
<description>Locations</description>
<map-to table="Locations"/>
<field name="locationID" type="long">
<sql name="locationID" type="numeric"/>
</field>
<field name="name" type="string">
<sql name="name" type="varchar"/>
</field>
<field name="company" type="com.numatica.example.Company"required="true">
<sql name="companyID"/>
</field>
</class>
<class name="com.numatica.example.User" identity="userID"
depends="com.numatica.example.Location" >
<cache-type type="count-limited" capacity="200"/>
<description>User</description>
<map-to table="Users"/>
<field name="userID" type="integer">
<sql name="userID" type="numeric"/>
</field>
<field name="location" type="com.numatica.example.Location"required="true">
<sql name="locationID"/>
</field>
<field name="name" type="string">
<sql name="username" type="varchar"/>
</field>
</class>
</mapping>
The persistence layer takes care of the rest, which encompasses the following:
- Finding dependent object groups
- Managing application object identity
- Managing persistent object identities (primary keys)
- Persisting each object in the appropriate order
- Providing cache management
- Providing the proper transactional context (we don’t want only a portion of the object tree persisted, do we?)
- Providing user-selectable locking modes
Available solutions
The available persistence layer solutions divide into the following groups:
- Roll your own, using perhaps the JDBC API
- Proprietary object-relational mapping tools or object databases (ODBMS)
- J2EE/entity bean CMP (container-managed persistence) solutions
- Java Data Objects (JDO)-based solutions
None of the solutions currently available satisfy all the criteria set forth for the ideal persistence layer. Below I review the first three solutions, before focusing on the JDO-based approach.
Roll-your-own approach
In the past, most developers turned to the roll-your-own approach for projects that use relational databases for persistent storage. The two technologies used by this approach, JDBC, and, to a lesser degree, SQLJ, require a sound knowledge of the relational SQL technology and, of course, provide no transparency. While this approach works fine for projects with a relatively small object model, it can quickly lead to hard-to-maintain code because the mapping tends to be implicit/hardcoded in multiple locations in the source code.
Several companies have attempted to develop in-house proprietary persistence layers. Because such a project is a nontrivial, expensive task, only large and deep-pocketed organizations can bank on this approach. But why should even those with deep pockets reinvent the wheel?
Proprietary object-relational mapping tools or object databases
Proprietary persistence layer solutions divide into two groups: object-relational (O/R) mapping tools and object databases.
Object-relational mapping tools
O/R tools have been around for quite some time, are quite mature, and are available from many vendors. The leading products are TopLink from WebGain, CocoBase by Thought, Inc., and the Visual Business Sight Framework (VBSF) by ObjectMatter.
Though most O/R mapping products offer consistent, simple APIs, generally, some vendors could improve on their simplicity. Also, most could offer less intrusion, though VBSF does a good job in this respect. While all these products provide reasonable transparency, albeit narrowed down to relational backends, they all have proprietary APIs. Nothing is wrong with proprietary implementations, as long as they conform to an established multivendor interface. That is not the case for the products mentioned here. In addition, mapping techniques vary greatly from vendor to vendor, so porting between vendors would probably be a significant issue. And finally, some proprietary O/R mapping tools have prohibitive deployment licensing costs in addition to development licenses.
These tools are mature and well supported, so despite the objections above and assuming the trade-offs are understood and accepted, it might still make sense to use these tools in the near future, until existing multivendor and open source standards mature. In the long term, however, these vendors are on a collision course with the upcoming standard persistence layer implementations and, to survive, they might need to provide unique features.
Object databases
The object database situation proves worse than the situation with O/R mapping tools. If you are a middleware application architect using these tools, you and the project are locked into a single mapping tool vendor. Moreover, you cannot replace the whole data storage system easily. However, most major ODBMS vendors have pledged to support the upcoming JDO standard, so this situation should improve.
J2EE/entity bean CMP
Many projects use EJB CMP detailed by the J2EE specification and delivered by EJB container vendors. Limited and simplistic, the EJB 1.1 CMP specification failed to cover more advanced relationships. The new spec, CMP 2.0, is much improved; however, entity beans introduce significant overhead in terms of code maintenance and performance.
JDO: Standardized, open, transparent persistence in/for Java
Two standardization/development efforts fall under the name Java Data Objects (JDO): Sun’s JDO and Castor JDO.
Sun’s JDO
The Java Community Process developed Java Data Objects (JDO), a high-level API specification and a reference implementation. A relatively recent effort, the JDO Java Specification Request (JSR-12) was approved in July 1999 and is a Proposed Final Draft at the time of this publication. In addition to the spec, JDO also includes a reference implementation, although the 1.0 beta version, shipped in second quarter 2001, implements persistence using flat files.
The specification
Sun’s JDO specification defines a simple, transparent interface between application objects and transactional data stores. It primarily interests JDO implementation providers. Several vendors have already lined up, but because the specification is a recent development and still in the draft stage, only one or two have implementations out of the beta stage; and not one vendor offers all the features outlined by the specification.
Although Sun’s JDO provides data storage-type transparency, the ODMG (Object Data Management Group) 3.0 persistent storage interface and several major ODBMS vendors have heavily influenced the specification. For example, the mapping mechanism format (an XML document type definition) is conspicuously brief, leaving implementation providers to define their own formats — which limits portability and therefore defies JDO’s purpose.
The spec covers:
- Persistence semantics with respect to transactions
- Interactions of transactional objects with J2EE
- Data selection (queries) based on Java expressions
Reference implementation
Sun’s JDO includes a functionally limited implementation of the file storage variant and allows storage, retrieval, navigation, and transactional update of persistent instances (no query capabilities).
Third-party implementations
Sun JDO implementation providers fall into two groups:
Nondatabase vendors (independents):
- SolarMetric Kodo JDO
- PrismTech OpenFusion JDO
- TradeCity Cybersoft’s Rexip JDO
Major ODBMS vendors:
- Versant Judo
- GemStone Systems
- IBM’s Informix
- Poet Software
- eXcelon (formerly Object Design)
Castor JDO
An open source project, Castor JDO started in 1999 under the auspices of Exolab. Despite its name, Castor JDO is not compatible with Sun’s spec, although it does not differ much. Castor JDO concentrates on relational data stores exclusively, so it does not support data storage-type transparency; however, among the nonproprietary API (open source or multivendor) solutions available, its feature set is certainly among the best, and the price can’t be beat. Though the development team seems relatively small, it has been developing Castor JDO longer than Sun has been developing its spec, and it shows. However, as several Sun JDO-compliant solutions quickly mature, Castor JDO’s advantage might not last long. Castor JDO meets many of the requirements presented for the ideal persistence layer and therefore deserves a more detailed look.
JDO: A strong persistence layer
This article provided an overview of object-relational issues and presented a list of requirements for the ideal persistence layer. While many proprietary products providing object-relational mapping are available, we need a standardized, open interface with multivendor support or a viable open source implementation. Two standardization and/or development efforts that meet these requirements have emerged: Sun’s JDO and Castor JDO.
Stay tuned for Part 2 of this series, which will describe in more detail Sun’s JDO and Castor JDO. I will discuss how they meet, or do not meet, the criteria for the ideal persistence layer. Part 2 will also compare JDO with other emerging persistence techniques, such as EJB CMP 2.0.