Can the compressed data length be longer than the uncompressed data
Q: I am working on a highly secure application and need to compress data such as string and byte arrays. I am using the java.util.zip.* classes, but I am having some problems.
First, when using the Deflator
and Inflator
classes, I get DataFormatExceptions
when the string is less than 30 characters.
Second, I have a question about the compression itself. I am using ByteArrayOutputStream and DeflaterOutputStream . I noticed that the compressdata.length() > OriginalData.length() where OriginalData is the uncompressed data. It doesn’t seem to make sense that the compressed length is longer than the uncompressed length. Can this be right?
A: In order to answer the first part of your question, I tested a string less than 30 characters and one greater than 30 characters. The only time that I could get a DataFormatException
was when Inflater
and Deflater
were constructed with different nowrap values. Be sure that the Inflater
and Deflater
specify nowrap the same way. If the Deflater
sets nowrap to false, the Inflater
must do the same. Likewise, if the Deflater
sets it to true, the Inflater
must set it to true.
Whether or not to set nowrap to true or false depends on your needs. A true nowrap omits the ZLIB header and checksum data from the compressed data. A false no wrap leaves it. However, the Inflater
‘s nowrap must be set to match the compressed input. Otherwise, as we have seen, you will get a DataFormatException
.
Your second question raises an important fact about data compression. As strange as it may seem, the compressed data size can be larger than the uncomp ressed size. Depending on your Deflater
settings, the Deflater
may append a header to the compressed data. This header is used to decode the information and check it for errors. If you deal with very small strings, it is likely that not much real compression has gone on. Cutting a string of 30 characters to 15, while a 50 percent reduction, is only a reduction of 15 characters. As a result, the added size of the header makes the compressed string longer than the original. You will not see the benefits of compression until your data reaches a certain larger, precompression size. It’s hard to say what this size is, but generically it is where: (compressed size + header size) < uncompressed size
. If your data is not large enough, you’re wasting time using compression.
You may also want to consider some of the other compression settings. Some compression algorithms are optimized for time, while others achieve a better compression but take longer to decompress. So the algorithm that you choose goes a long way in determining the final size of your compressed data.