Plant your data in a ternary search tree

Create an English dictionary that checks spelling and matches words as you type

The ternary search tree (TST) is the champion of data structure acrobatics — it finds all keys having a given prefix, suffix, or infix. It even finds those keys that closely match a given pattern. You can easily search the tree for partial matches — auto matches autograph, automatic, and so on. In addition, you can implement near-match functions, which gives you the ability to suggest alternatives for misspelled words — e.g., impliment matches implement.

A TST stores key-value pairs, where keys are strings and values are objects. TST keys are stored and retrieved in sorted order, regardless of the order in which they are inserted into the tree. In addition, TSTs use memory efficiently to store large quantities of data. Best of all, all of this magic is performed lightning fast. The tremendous flexibility of TSTs provides ample opportunity for creative programming.

This article demonstrates three creative applications of the TST:

An English dictionary that matches words as you type and checks spelling
A flexible array that can assume any size or dimension on the fly
A database that stores all information in the same place (regardless of which record or column the information belongs to), thereby decreasing access time and reducing storage requirements

Let’s begin with a demonstration of the dictionary application.

The dictionary applet

The applet below simulates an English dictionary that uses a TST to store dictionary data. To create this applet, I started with a list of roughly 48,000 words that I found on the Internet. (I can’t promise that the list is complete or that all of the words are spelled accurately.) For each word, I created a definition equal to the string "Definition of the word " + currentWord + ".", and stored these key-value pairs (word, definition) in the TST.

Figure 1. An English dictionary built on a ternary search tree

TST structure

From an external perspective, the TST behaves like a HashMap or a HashTable. That is, objects stored in the tree are indexed using string keys. Internally, the TST stores each key character in a separate node.

Figure 2. A ternary search tree Click on thumbnail to view full-size image. (19 KB)

Figure 2 illustrates two types of objects: objects that represent tree nodes (no color) and value objects (yellow or gray) encapsulated by the nodes. Arrows represent references — if a bidirectional arrow connects nodes “s” and “o,” then “s” has a reference to “o,” and “o” has a reference to “s.” Each TST node encapsulates one character from a key (its splitchar), one value object, a reference to its parent node, and a reference to each of three child nodes — lo kid, equal kid, and hi kid. A lo kid is below and to the left of its parent, an equal kid is directly below its parent, and a hi kid is below and to the right of its parent. The highest node in the tree is the root node; its parent is null.

The TST in Figure 2 stores five key-value pairs. The keys are so, tab, to, too, and tot. The string too, for example, points to an instance of an object labeled too data in the diagram.

TST algorithms

Let’s turn to the manipulation of data stored in the TST. You need a means for using a string key to retrieve data from the tree, using a string key to insert data into the tree, retrieving any data whose key matches a given prefix, and for retrieving any data whose key closely matches a given string.

The following algorithm retrieves the object indexed by a string key.

Object get(String key) {
    TSTNode currentNode = rootNode;
    int charIndex = 0;
    while(true) {
        if(currentNode == null) return null;
        if (key.charAt(charIndex) == currentNode.splitchar) {
            charIndex++;
            if(charIndex == key.length()) return currentNode.data;
            currentNode = currentNode.relatives[TSTNode.EQKID];
        } else if(key.charAt(charIndex) < currentNode.splitchar) {
            currentNode = currentNode.relatives[TSTNode.LOKID];
        } else {
            currentNode = currentNode.relatives[TSTNode.HIKID];
        }
    }
}

Within a given iteration, this algorithm focuses on a certain node within the tree — the currentNode — and on a certain character within the key — though not explicitly named above, let’s call it currentChar.

If currentChar comes before the currentNode‘s splitchar in the alphabet, the algorithm shifts its focus to the currentNode‘s lo kid, does not change the value of currentChar, and repeats the process. If currentChar comes after the currentNode‘s splitchar, the algorithm shifts to the currentNode‘s hi kid, does not change currentChar‘s value, and repeats the process. If currentChar is equal to the currentNode‘s splitchar, the algorithm shifts to the currentNode‘s equal kid, sets the value of currentChar to the next char in the key, and repeats the process. When the currentChar‘s value changes, the algorithm checks to see if the end of the key has been reached; if it has, it returns the data object — the value — stored in the currentNode.

Refer back to Figure 2 as I walk you through the process of retrieving the data indexed by tab. First, set currentChar to the first letter in the key — tab — and set currentNode to the root node — the “t” node, or highest node in Figure 2. Compare currentChar with the rootNode‘s splitchar; since both are equal to “t,” shift to the “o” node and set currentChar to the next character in tab. Since “a” comes before “o” in the alphabet, move to the lo kid — the “a” node.

Since both currentChar and splitchar are now “a,” move to the equal kid — the “b” node — and set currentChar to the next character in tab. Now, both currentChar and splitchar equal “b.” You’ve reached the end of the key, so the data object stored in the “b” node is the data you’re after.

Consider the problem of storing the key-value pair: tab, myObject. To store the pair, you can use the same algorithm with one slight modification — once it finds the tab node, instead of returning the data object, the algorithm stores myObject in the node (overwriting any data already stored in that node).

Suppose you wish to store a key-value pair and the tree lacks some or all of the nodes required to represent the key. Again, you use nearly the same algorithm. Upon entering a new iteration of the while loop, if the currentNode is null, simply create the missing node and add it to the tree. The new node’s position should be the position in which you expected to find it.

Say you wish to store tub, myObject in Figure 2’s tree. Set currentChar to the key’s (tub) first character, and set currentNode to the root node. Compare currentChar to the currentNode‘s splitchar; since both equal “t,” set currentChar to the next character in tub and set currentNode to the equal kid — the “o” node. Compare currentChar to “o.” Since “u” comes after “o” in the alphabet, set currentNode to the hi kid, leaving currentChar unchanged. Since currentNode is null, create a new node, setting its splitchar to currentChar, “u.” The new node will become the “o” node’s hi kid. Now compare currentChar to the currentNode‘s splitchar; since both equal “u,” set currentChar to the next character in the key and currentNode to the equal kid. Because currentNode is null, create a new node, setting its splitchar to currentChar, or “b.” The new node will become the “u” node’s equal kid. Compare currentChar to the currentNode‘s splitchar; both are equal to “b.” You’ve reached the end of the key, so store myObject in currentNode — the “b” node. Figure 3 depicts the new tree structure.

Figure 3. Figure 2’s TST after adding tub key. Click on thumb- nail to view full size image. (23 KB)

The example described above highlights an important characteristic of the TST data structure: Though tub has three characters, adding its key resulted in the addition of only two new tree nodes. That results because the TST reuses the “t” node (root node in Figure 2) in the representation of the new key, tub. Similarly, if you add a new key tool, only one more node (splitchar equal to 'l') would be required. Because the TST reuses nodes and because it creates nodes only when needed to represent new keys, the TST occupies little memory relative to many other data structures.

You’ve examined the basic storage and retrieval of data. Now explore how to find data having keys that only partially match a target string.

To find all nodes whose keys have a given prefix, simply find the node indexed by that prefix. Regard that node’s equal child as a subtree’s root node. Every node in the subtree represents a key that begins with the prefix. Since the tree (or subtree) structure maintains the alphabetical order of its keys, a recursive algorithm that retrieves a list of tree data will order the list according to the keys’ alphabetical order.

What does it mean to find all nodes whose keys nearly match a target string? In this case, keys can differ from a target string by a specified number of characters. For example, impliment differs from implement by one character. An algorithm that implements that functionality can form the basis for dictionary spell-checking. The recursive algorithm below finds all keys differing from the target key by “d” characters:

matchAlmost(String key, int i, TSTNode currentNode, int d) {
    if((currentNode == null) || (d < 0) || (i == key.length())) return;
            
    // low branch
    if(key.charAt(i) < currentNode.splitchar) matchAlmost(key, i, currentNode.relatives[TSTNode.LOKID], d);
            
    //equal branch
    int nextD = (key.charAt(i) == currentNode.splitchar) ? d : d - 1;
    if((key.length() == i + 1) && (nextD == 0) && (currentNode.data != null)) {
        // add the key of the current node to the list of nearly matching keys
    }
    matchAlmost(key, i + 1, currentNode.relatives[TSTNode.EQKID], nextD);
    // hi branch
    if(key.charAt(i) > currentNode.splitchar) matchAlmost(key, i, currentNode.relatives[TSTNode.HIKID], d)
}

The algorithm above resembles the algorithm for finding matching keys, except that it regards a key character (currentChar) as matching a currentNode‘s splitchar when the characters actually match or when the current number of mismatching characters has not exceeded the desired number. Unlike the algorithm used in the get method, the algorithm above is recursive and can return multiple values.

Flexible arrays

You may use a TST as an array of type Object. Simply convert the array index to a string and regard it as a TST key. For example, to store myObject into slot int i = 79, call myTST.put(String.valueOf(i), myObject);. You can easily wrap a TST in a class that automatically performs the int-to-string conversion. It isn’t necessary to declare the capacity of such an array in advance — that is, when the array instantiates.

You can use an arbitrary separator (say, 'u0800') to separate indices of multidimensional arrays. For example, to add another index int j = 7 to the prior example, call myTST.put(String.valueOf(i) + 'u0800' + String.valueOf(j), myOtherObject). Just as it isn’t necessary to declare the capacity of such an array in advance, it isn’t necessary to declare its dimensions in advance either. In fact, a one-dimensional array can occupy the same TST as other multidimensional arrays. In that case, an array’s data with a specific dimension will not overwrite those arrays having other dimensions.

A TST database

Let’s use a TST to build the following database:

**A simple database**
Employee ID	Employee eye color	Employee hair color
33	Blue	Blue
37	Brown	Brown

Use special marker characters (e.g., 'u0800', 'u0801', and so on) for a TST node’s splitchar to identify database columns. Figure 4 depicts a database with three columns and two entries. The columns are employee ID ('u0800'), employee eye color (u0801), and employee hair color ('u0802').

Figure 4. A TST database Click on thumbnail to view full-size image. (25 KB)

The column identified by the splitchar 'u0800' contains an integer ID that is unique for each employee, like a social security number. Such values are known as primary keys. The ID entry for person number 33 is added to the database with the statement myTST.put(String.valueOf(33) + 'u0800', emp33DBEntry), where emp33DBEntry is an instance of a DatabaseEntry class (described below). Placing the marker characters at the end of the keys, rather than at the beginning, allows columns and entries to share TST nodes, thus reducing the memory occupied by the database.

As depicted in Figure 4, an instance of the DatabaseEntry class will be stored in one employee ID node, a 'u0800' node. The same DatabaseEntry instance will also be stored in two linked lists. One of these lists will reside in an eye color node — a 'u0801' node — and the other in a hair color node — a 'u0802' node. The database entry object has a handle to the TST node that terminates its primary key column entry, the 'u0800' node. It also has a handle that points to the LinkedList node storing its eye color entry and another handle that points to the LinkedList node storing its hair color entry. Each of these LinkedList nodes has static handles to the TST nodes that encapsulate the LinkedLists. That allows you to decipher the column value by walking the tree backwards until you reach the root node.

To retrieve the hair color of employee 37, obtain the database entry object indexed by the string "37" + 'u0800'. From this database entry object, obtain the hair color LinkedList node; from this node, obtain the encapsulating TSTNode, 'u0802'. From 'u0802', walk backwards up the tree, reconstructing the hair color value as you go.

Suppose you want to set employee 33’s eye color to blue. Before storing a value in a nonprimary key column, you must verify that the value does not already exist. First, retrieve employee 33’s DatabaseEntry object: emp33DBEntry = myTST.get(String.valueOf(33) + 'u0800'). This object encapsulates a handle to a LinkedList node for the eye-color entry, and you can easily delete the node. Now call myTST.put(String.valueOf("blue") + 'u0801', emp33DBEntry). Don’t forget to reset the eye-color handle that lives in emp33DBEntry so that it points to the new LinkedList node.

To attain the employee ID for every employee having blond hair and blue eyes, retrieve the linked list indexed by the string "blond" + 'u0802' and the linked list indexed by the string "blue" + 'u0801'. Check the length of these lists and discard the longer one. Suppose the blond hair list is the shorter list. Obtain the TST node indexed by the string "blue" + 'u0801', the “blue-eye” node. Now for each entry in the blond hair list, retrieve the DatabaseEntry object and check to see if its eye-color handle references a LinkedList node that lives in the blue-eye node. If its eye-color handle does indeed reference the blue-eye node, then add the DatabaseEntry object’s employee ID to the result list, otherwise move on to the next node.

Using the same list of words I used to build the dictionary applet, I built a database similar to the one described above. Like the dictionary applet, this database featured about 48,000 entries. The word list served as the set of primary keys that uniquely identify employees. As each database entry was created, I used Java’s random number generator — Math.random() — to choose one of three values for eye color and one of three values for hair color. This algorithm produces about 16,000 (48,000/3) people having each eye color and about 16,000 people having each hair color. About 5,333 (48,000/(3*3)) people will have a given combination of hair color and eye color. Running on my 600MHz laptop, the algorithms described above take about 100 milliseconds to return a list of roughly 5,333 people having a certain hair and eye color.

While Figure 4’s database has only a single table, it isn’t difficult to devise a TST database structure that supports multiple tables.

Conclusion

The ternary search tree (TST) behaves like a HashTable, in that it uses a key to quickly locate, store, and retrieve data. TST keys are stored (and thus retrieved) in sorted order. A HashTable can use any Object as a key, as long as the object implements a hash function. In contrast, a TST must use a string key. A TST is superior to a HashTable in its ability to find data having keys that partially match a target string. Furthermore, a TST allocates no memory for data that has yet to be placed in the tree and reuses nodes (memory) that multiple data values have in common, thus reducing memory requirements for large data amounts. The flexible way in which the TST searches for and accesses data provides numerous opportunities for creative programming.

Wally Flint is an independent consultant
specializing in object-oriented design for Java development. He has
expertise in both client and server-side Java and has developed a
number of original sensor technologies based on novel algorithms —
including an accelerometer (gravimeter), wind speed and direction
sensors, and wind turbine yaw angle sensors. He earned a BSEE from
the University of Michigan and a master’s degree in international
business from The American Graduate School of International
Management. His hobbies include developing Java software for
linguistic and multimedia applications.

Source: www.infoworld.com