Download a Website for offline browsing

Use common Java classes to build an offloading utility

In this article, I guide you through the steps involved in designing a utility to download a Website. This utility downloads only text and image files, but it can easily be extended to download files of any type. At the end of the article I’ll provide tips on how you can extend the utility.

First, a brief introduction to URLs (Uniform Resource Locators) would not be out of place. The general form of a URL is:

protocol://machinename[:port]/filename[#referenve].

An absolute URL — such as — has all the components required to identify the resource on the Web. In relative URLs, the protocol and the machine name are inherited from the base URL embedded in the document (base tag) or from the URL used to retrieve the document. For example, assume that you have downloaded an HTML document using the URL and that this document has a link home.html. The link actually points to For more information, please see Resources.

The utility I describe in this article uses the URL class in the java.net package. The class provides three methods to obtain data from the URL. In this utility, I use the method public final InputStream openStream() throws IOException to establish a connection with the URL and to return an InputStream object to get the data from the URL. Note that the data does not contain any of the HTTP headers. This method hides all the intricacies of setting the appropriate parameters to make a connection and connecting to the remote resource. It returns the InputStream, which helps you to get the data as you would get any other file stream.

Some of the commonly used protocols are HTTP, FTP, Gopher, and News. This article deals only with HTTP (HyperText Transfer Protocol), an application-level protocol commonly used to transfer hypertext documents across the Internet. HTTP has gained importance because of its simplicity and low overhead.

The main idea

Suppose you visit a Webpage containing links to several other pages that, in turn, have links to still other pages. You want to download all those pages onto your hard disk. How would you accomplish this? You could simply visit all the pages and save them on your hard disk, right? However, that is not only a tedious process but also an inconvenient one. The links in the pages may not be pointing correctly (relative to the location of other pages you are downloading), or the links might be absolute URLs pointing to the remote machine (in which case, downloading the page becomes useless). You could manipulate the links manually, but that would also be a painful process.

This utility lets you download all the pages of a Website in a graceful manner. It follows these simple steps:

It downloads a page and stores all the links inside a vector
It loops (or iterates) over all the elements of the vector, repeating Step 1 and Step 2 recursively

The utility consists of four classes: DownloadSite, Downloader, URLlist, and ExtendedURL. You can download the source code from Resources.

DownloadSite

The

DownloadSite

class reads the command line arguments and does some initialization. It contains the

main()

method. This utility takes at least one but no more than two arguments. The first argument is the site name, and the second is the location of the new directory created to hold the downloaded files. If you do not specify the second argument, the files are downloaded into the current directory.

If you need to use this utility behind a firewall, the changes should be done in DownloadSite. See Resources for information on how to access the sites when you are behind a firewall.

DownloadSite parses the command line arguments and passes them to the Downloader class, which does the actual downloading.

Downloader

Downloader

is the heart of the utility. This class contains the logic used to download the pages and the code to manipulate the links.

You use recursion to download the pages. The logic is simple:

    private void startDownload(URL u)
    {
        ...
        listOfURL = downloadAndFillVector(in, out);
        /*
         * downloadAndFillVector downloads the file (and also
         * manipulates the link) and returns a vector of URLs
         * in the file specified by URL u.
         * After the execution of this statement, listOfURL contains
         * the URLs in the current page that needs to be downloaded.
         */
        ...
        sizeOfVector = listOfURL.size();
        for(int i = 0; i < sizeOfVector; i++)
            startDownload((URL)listOfURL.elementAt(i));
        /*
         * Loop through all the elements of the vector and
         * call startDownload recursively. The process repeats
         * downloading all the pages
         */
    }

I should explain two private members of this class: private String hostName and static Vector URLs:

hostName contains the machine name from the first page’s URL (the URL provided at the command line). In any page, you can have two types of links: absolute and relative. If the link is relative, use this hostName to retrieve the document. But if the link is absolute, you must check whether or not the host name in the link is the same as hostName. If it is, include this link in the list of URLs to be downloaded. If it isn’t, ignore this link. For example, if you are downloading a site, say www.somesite.com, and one of its pages contains a link to www.othersite.com, you do not want to download pages from www.othersite.com.
URLs is the global vector where you keep adding all the pages you download. When you get a link, check whether or not the link is already present in URLs. This prevents you from downloading a page twice. Another common scenario: Often a page a.html can link to another page b.html, and the page b.html can also have a link to a.html (from the Back button). static Vector URLs also helps you avoid falling into such loops.

To download text files and binary files, you must have separate methods for each. From the file extensions, decide whether the file is a binary (image) file or a text file. The method nonTextFile() returns true if the file is not a text file. For efficiency, call a different method, downloadNonTextFile(), to download binary files. This function does not perform any file parsing.

If the files are text files, you must extract the links and modify them appropriately. If you wanted only to extract the links, you could use the Swing package to do so (see Resources).

But for this article, you also want to change the links, so you should parse the files.

The general strategy

First, search all the occurrences of

<a

<base

<img

, and

<frame

(irrespective of case) and store the characters up to the next enclosing

>

in the string. For example, in the case of “Click here to enter,” the sequence

a href="

is stored in the string. Extract the URL from this string, but first make sure to take care of several things:

If the link is absolute, such as it begins with a protocol name (http, ftp, news, and so forth — although this article is concerned only with http). In such cases, check whether or not the hostname is the same as the hostName:
- If it isn’t, do not modify the link or download the file.
- If the hostnames are same, manipulate the link. Suppose the destination directory is /work/tp; you then modify the link to /work/tp/docs/index.html (meaning the destination directory plus the link’s filename) and add the unmodified link to the list of links to be downloaded.
If the link begins with a backslash (/), such as in /images/back.gif, the hostname and protocol are guaranteed to be same. Modify the link to include the destination directory and add the unmodified link to the list of links to be downloaded. For example, assume you have downloaded a page from which has a link /images/back.gif. If the destination directory is /work/tp, the link should be modified to /work/tp/images/back.gif.
If the link does not begin with a backslash (/), you need not modify the link. Add the link to the list of links to be downloaded.
If the link ends with a backslash (/), modify the link to include index.html and add the unmodified link to the list.
If the base tag (<base) is present in the document, evaluate the relative URLs using the base URL. In such cases, first evaluate the filename using the base URL and the relative links, then modify the link to the destination directory plus the newly evaluated filename. The unmodified link is added to the list of links to be downloaded. An important thing to note is that the base tag is removed from the downloaded file. Here’s an example of how the relative links are resolved in this case:
```
<html>
        <head>
        <title>Just an example</title>
        <base href="
        </head>
        <body>
        Let us manipulate this link
        </body>
</html>
```
Now, the relative URL …/someotherproduct/index.html would resolve to The filename (meaning the newly evaluated filename) is then docs/someotherproducts/index.html. If the destination directory is /work/tp, the link is modified to /work/tp/docs/someotherproduct/index.html.

A number of methods are written to parse the file and manipulate the links. downloadAndFillVector() does the first-phase parsing. It scans the file being downloaded for <a, <base, <img, and <frame, and it stores the characters up to > in a StringBuffer. These characters are not written to the downloaded file, since you need to modify the links present in this string. This string is passed to another method, modifyLink().

Take a look at the following example:

<BODY>
<P>I just returned from vacation! Here is a photo of my family at the
lake:
<IMG SRC=image/family.gif  alt="A photo of family at the lake">
</BODY>

After the first-phase parsing, the string obtained is:

IMG SRC=image/family.gif  alt="A photo of family at the lake"

modifyLink() does the second-phase parsing. It searches for the occurrence of href or src in the string. Now, the link can be either one of the following:

SRC=image/family.gif
SRC =image/family.gif
SRC = image/family.gif

Most browsers accept spaces on either side of the equal sign (=) between href or src and the link. modifyLink() evaluates the index of a link’s beginning (in this case, i). This index, the string, and the length of the string are passed to another method, processLink(). So, after the second-phase parsing, you have:

image/family.gif  alt="A photo of family at the lake"

processLink() performs the final phase of parsing. It finds out what is at the end of the link. Now you have image/family.gif. Depending on the link, processLink() modifies the link and writes the modified link to the file. The unmodified link is returned to downloadAndFillVector(), which adds the link to a vector.

downloadAndFillVector() does one more thing. If the base tag (<base) is present, downloadAndFillVector() extracts the URL from the base tag and assigns it to baseTagURL, which is a private member of Downloader. You use the baseTagURL to get the actual file paths in case the links are relative. downloadAndFillVector() calls the setBaseTagURL() method to extract the base URL from the base string, parsing the same way you did in the first and second phase.

Finally, the list of links is passed to formVectorOfURLs() method. This method creates an object of the URLlist class, whose sole purpose is to generate complete URLs using the base URL and the links so you can use them to download Web pages.

URLlist

URLlist

is a simple class. It receives the base URL, which is either the URL specified in the base tag or the URL used to retrieve the document, plus the list of links in the page to be downloaded. From this, it generates a list of URLs and returns the list to the

Downloader

class. It also adds the link to the global vector

URLs

. This class contains the functionality that prevents a link from being downloaded twice.

ExtendedURL

The

URL

class provides a method

getFile()

that returns the filename (anything after the machine name up to # or the end of the string). You need a way to get the directory and the filename, and you need the directory name to maintain the same file structure.

URL

, being a final class, cannot be extended, so you can use composition. (There are basically two approaches in object-oriented programming to achieve code reusability:

inheritance

and

composition.

In composition, you use the existing class as a member of the new class, which is composed of the already existing class along with other members. For more information, see

Resources

ExtendedURL

has a member field of type

URL

. The

ExtendedURL

class provides methods

getDirectory()

and

getFile()

, which return the directory and the filename, respectively.

First, obtain the filename (directory plus the optional filename) using the getFile() method of the URL class. Then search for the question mark (?) in the filename. A question mark indicates that the query string is appended to the filename and that the file is a script. You can’t download scripts, so both directory and file are set to null. This URL is not added to the list of URLs.

If the filename ends with /, the directory is set to the filename and the file is set to null. In other cases, you can extract the directory and file from the filename by performing some simple calculations.

One more thing determines the setting of the directory and file.

Consider these two URLs:

http://www.somesite.com/work/xyz.html
http://www.somesite.com/work/abc.mp3

In the first example, the directory is set to /work and the file is set to xyz.html. In the second example, the file is abc.mp3; however, since here you are not interested in MP3 files, set the filename and directory to null.

Use the method fileOfInterest(), which returns true for the files that interest you. You can add MP3 files in this method, so this utility can download MP3 files as well.

To understand this utility further, you can browse through the source code in Resources. This utility, however, has some limitations.

Limitations of the utility

The following list reveals the limitations of the utility described in this article. You can easily overcome most of them to enhance your own offloading utility:

It downloads only text and image files. This is a minor limitation you can easily tackle by modifying the fileOfInterest() method of the ExtendedURL class and the nonTextFile() method of the Downloader class.
It cannot download applets. Applets are specified in HTML files as <APPLET CODE=Simple.class CODEBASE="/examples"></APPLET>, where CODE specifies the class file and the CODEBASE specifies the directory in which this class file is located.

To overcome this limitation, search for <applet in the first phase of parsing similar to <img, <frame, and other tags. During the second and third phase of parsing, extract the filename and the directory from the CODE and CODEBASE attributes. The link becomes /example/Simple.class, which is just like /images/family.gif.

Note, if both CODE and ARCHIVE are present in the APPLET tag, the filename should be a jar file, not a class file. For example, <APPLET CODE=Simple.class ARCHIVE=examples.jar CODEBASE="/examples"></APPLET> should be /examples/examples.jar, not /examples/Simple.class.

fileOfInterest() should return true for class and jar files. You can modify the nonTextFile() method of Downloader to return true for class and jar files.
It does not handle forms correctly. The form tag has the following syntax:
```
<FORM ACTION=" METHOD="POST">
```
If the URL specified with the ACTION attribute is absolute, this utility will work well. If it does not, you can handle this limitation easily by replacing the relative URL with the absolute URL. Obtaining the absolute URL from the relative URL is simple when you know the base URL:
```
 
    URL absolute = new URL(baseURL, relative);
    String absoluteURL = absolute.toString();
```
It does not download background images. This limitation can become a major one depending on the context. The background attribute can be present in some HTML tags. The value of this attribute is the image to be displayed in the background, which this utility does not download. To overcome this, search for the background attribute in some of the HTML tags and then generate a complete URL to download background images.
It does not handle scripts. At the time you modify the links, the utility has no idea whether the link is a script (executable) or an ordinary file. A simple guess can tackle most of the situations. If the filename ends with .pl or .cgi, or if it has no extension, you can assume it is a script. In that case, replace the relative link with the absolute link.

I avoided overcoming these limitations in this article to make the utility simple. One of the goals of this article was to show you how easily you can write a useful utility with commonly used Java classes. If you provide this utility with a nice GUI and tackle some of its limitations, it can serve as a full-fledged application.

Rakesh Didwania is a software engineer at
Informix, India. Previously, he worked at Fujitsu-ICIM, India,
developing a translator that converts Informix 4GL code to C.

Source: www.infoworld.com