Access the world’s biggest database with Web DataBase Connectivity

Use SQL to query Web-based information

What’s your hobby? Mine is amateur investing and the Web helps me a lot. I can log on to one of my favorite financial information sites and look up the latest share prices, or I can log on to a different site to see the news items for the stocks I hold. I have a secret recipe for choosing shares, which I can’t tell you, but suppose it’s something like: “I will buy shares in a company whose price drops to its 52-week low and has appeared in the news this week.” For the first part of that formula, how do I find all companies at their 52-week low? In the past, I viewed the information for each company individually and did a mental calculation. Now I simply type something like this:

SELECT ... AS name, ... AS price, ... AS yearLow
FROM 
WITHPOST symbol=ALL
WHERE price=yearLow

Besides my hobby, I also have a business — the Java business. Therefore, I like to stay informed of the latest Java news and views, which I can do at this site, which might be familiar to you (Figure 1):

Figure 1. The JavaWorld homepage. Click on thumbnail to view full-size image (69 KB).

You will see from the picture that today I’m interested in Enterprise Java Beans, so I’ve typed “ejb” into the search box. The JavaWorld search engine does its job and rewards me with Figure 2:

Figure 2. Search Results for “ejb.” Click on thumbnail to view full-size image (68 KB).

The results appear in descending date order and — although you can’t see it in the picture — the result set contains entries for prior months as well as the current month. Now I want to see the results in ascending date order rather than descending date order but with results prior to December omitted. If I had some software capable of running it, I could express exactly what I want to see with this piece of SQL:

SELECT ... AS linkText, ... AS description, ... AS date
FROM 
WITHGET qt=ejb
WHERE date MATCHES '*Dec*'
ORDER BY date

Figure 3. SQL results. Click on thumbnail to view full size image (13 KB).

Look at how I’ve placed the Webpage URL in the FROM clause, in the space where a table name would usually go. You will also notice a new keyword, WITHGET, which I’ve introduced to trigger a form submission. The result of running this SQL statement would be as shown in Figure 3. I could export the data to another application, a RDBMS or a spreadsheet program. If I issued the query from within a Java program — using something like JDBC — I could present the data in my own style, manipulate the data, and combine it with results from other sites. Imagine the possibilities.

That’s enough for setting the scene. You should have a fairly good idea of what I’m proposing and why I think it’s such a great idea, so now I’ll tell you a bit more about this new variation on the SQL theme.

From HTML to SQL

Figure 4. Hierarchy of HTML elements. Click on thumbnail to view full-size image (29 KB).

As designed, SQL works with regular two-dimensional tables with data arranged neatly into rows and columns. Web-based information — in the form of HTML — just isn’t like that, as you can see from the hierarchy of elements comprising the JavaWorld search page (Figure 4).

To move from the HTML hierarchy to the regular SQL table paradigm, you must think of SQL in a slightly different way. As a simple example, look at how I’ve transformed the following SQL statement that runs against an employee table with two columns:

 
> SELECT name, salary FROM employee WHERE name="Tony"
> SELECT row[*].column['name'], row[*].column['salary'] FROM employee WHERE row[*].column['name']='Tony'
> SELECT row[*].column[1] AS name, row[*].column[2] AS salary FROM employee WHERE name="Tony"

The final statement means exactly the same, semantically, as the first. It reads, “Select the first column (call it name) of any row and the second column (call it salary) of any row in the employees table where the name is equal to ‘Tony.’”

Using this syntax I could propose a new query language called 3D-SQL that operates on tables with the traditional two dimensions plus an additional third dimension, called history. Each table cell is now addressable as row[x].column[y].history[z], which allows me to construct statements like:

> SELECT row[*].column[1].history[1] AS name, row[*].column[2].history[*] AS salary FROM employee WHERE salary>10000

The statement means, “Select the first historic element of any row’s first column (call it name) and any historic element of any row’s second column (call it salary) where the salary is greater than 10,000.” In plain English it means, “Give me the names of those employees who have ever had a salary greater than 10,000.”

Although I’ve placed the wildcard characters within row specifications (row[*]), I could reverse the sense and use column[*] to do something rather more difficult with traditional SQL: search across the table columns rather than down the table rows.

Now when you look at Figure 4, you might see it in a new light. It’s not quite so frightening now, I hope, and I can return to my earlier query against this page and fill in the blanks (note the placement of the wildcard characters) to give:

SELECT .table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000]
.tr[0001].td[0000].table[*].tr[0000].td[0000].htxt[0000] AS linkText,
.table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000]
.tr[0001].td[0000].table[*].tr[0000].td[0000].text[0001] AS description,
.table[0001].tr[0000].td[0002].table[0002].tr[0000].td[0001].table[0000]
.tr[0001].td[0000].table[*].tr[0000].td[0001].table[0000].tr[0001].td[0000]
.text[0000] AS date
FROM 
WITHGET qt=ejb
WHERE date MATCHES '*Dec*'
ORDER BY date

The proposed 3D-SQL name will have to go because we’re now working in more than three dimensions, and since the lowest common denominator for these queries will be HTML, I’ll now settle on the name HTMSQL (HyperText Markup Structured Query Language). I suspect that this SQL has a lot in common with the versions of SQL for object databases, but that’s not my specialty.

Of course, coming up with these kinds of queries proves difficult if all you’ve got to work from is the Webpage as presented in a browser. Nor are we all fluent in raw HTML, so I propose some new tools.

SQL tools for the Web

For starters, we need an engine to run the SQL statements. I started to implement such an engine and became increasingly excited about the idea as I did more work and ran more test queries. I started with a simple engine that allowed queries against a single-source Webpage. At the time of this writing, the engine has been extended to allow selections from multiple Webpages — as though from separate database tables — with the results being joined together.

Devising the queries to put through the engine was very hard work, so I came up with the idea of an Assistant that would present a source HTML page in the hierarchical form shown above in Figure 4. Not only would this show how sets of elements repeat (to help you decide where to put the wildcard characters), it would also allow an SQL statement to be constructed with a few mouse clicks.

The peculiarities of HTML would be handled by highlighting certain elements such as links, forms, and frames, and by allowing you to drill down into frames and follow links to find the data you desire. Form submissions would be automated in order to support dynamic — and static — pages as data sources.

I have produced a demonstration version of the Assistant — as a Java 2 applet — that lets you run some example SQL queries on live Web data using the SQL Engine. See the link in Resources to try it out.

This kind of assistant would be a useful end-user tool, allowing data to be extracted from the Web and exported to a spreadsheet or database application for further processing. But I’d really like to capture — as would you, I suspect — the results into a Java program, probably a servlet that could be used as a portal. To accomplish that task, we need a Java API.

The WDBC API

My title for the Java API — Web DataBase Connectivity (WDBC) — suggests an API similar to JDBC. Although I might do so in the future, I’ve not followed the JDBC API to the letter because we require only a small subset of the JDBC functionality for this task. Indeed, support for updates is not relevant, nor is support for stored procedures. However, I do think that the programmatic feel of the API should be similar to JDBC so, if you know what JDBC code looks like, I’m sure you’ll spot the similarities in this WDBC code:

int sqlColumns=3;
String sqlStatement="SELECT .table[0001]...htxt[0000] AS linkText, "
+".table[0001]...table[*].tr[0000].td[0000].text[0001] AS description, "
+".table[0001]...table[*].tr[0000].td[0000].href[0000] AS linkn"
+"FROM n"
+"WITHGET col=qt=ejbn"
+"WHERE linkText MATCHES '*December*'";
SqlEngine sqlEngine=new SqlEngine();
try
{
  ResultSet resultSet=sqlEngine.executeQuery(sqlStatement);
  while (resultSet.next())
  {
    for (int c=1; c<=sqlColumns; c++)
    {
      System.out.print(resultSet.getString(c)+"|");
    }
    System.out.println();
  }
}
catch (Exception ex) { ex.printStackTrace(); }

Queries are submitted via an executeQuery() method and the results come back as a ResultSet, the elements of which you can step through one by one using next(). Having extracted the data into your Java program, you can now do whatever you like with it.

Uses for WDBC

I see three main uses for the HTMSQL and the WDBC Java API:

As an end-user tool to extract selected information from any Website for loading into a spreadsheet or relational database for further analysis and reporting.
On portal sites that would use a WDBC servlet to collect and collate data from numerous sources and re-present it as information in new ways and in novel combinations, maybe according to the preferences of the viewing user.
As the enabling technology for a new breed of Web mining and information extraction (IE) tools that would let WDBC handle the syntax-side of things — that is, getting hold of the structured information. The data would then be subjected to a more rigorous textual (possibly even natural language) analysis, allowing the data to be acted on according to its meaning.

When you’re thinking about possible applications, keep in mind that you’re not restricted to the textual content of Webpages. Links, images, and other HTML elements can all be distinguished, meaning that your Java program could follow a link according to the host name of the link. Or what about displaying an image only if the accompanying text contains the word “Madonna?”

The example SQL statements provided for you to try the demonstration applet (see Resources) allow you to:

List the populations of the major cities in your country, ordered by city or population
See the latest share prices for the stocks you hold
Get some solar system facts (to demonstrate a two-page join)
Find out who is the registered owner of the domain you want

Conclusion

For quite some time I puzzled over how to extract information easily from Webpages so I could build new kinds of applications. Now that I’ve become so engrossed in what you might call the enabling technology, I’ve become interested in the technique itself rather than in its applications, so — for the moment — I’ll leave the applications to someone else. Maybe you have some ideas of your own, which I’d be interested to hear.

The proof-of-concept SQL Engine has been rewritten to take the idea forward without much of the original code. The new engine includes support for selections — and joins — across multiple source pages, which I think will increase the usefulness enormously. The priority now focuses on extending the Assistant to a wider toolset that makes it all child’s play.

I’m not the only one to think this is a great idea. A couple of commercial tools — W4F from the University of Pennsylvania and Caesius’s WebQL — address the same problem in a similar, yet different way (see Resources).

Source: www.infoworld.com