These are just some random thoughts of mine. Your mileage
may vary.
4.1. Limitations of the compressed file format
bzip2-1.0.X,
0.9.5 and
0.9.0 use exactly the same file
format as the original version,
bzip2-0.1. This decision was
made in the interests of stability. Creating yet another
incompatible compressed file format would create further
confusion and disruption for users.
Nevertheless, this is not a painless decision. Development
work since the release of
bzip2-0.1 in August 1997 has
shown complexities in the file format which slow down
decompression and, in retrospect, are unnecessary. These
are:
The run-length encoder, which is the first of the
compression transformations, is entirely irrelevant. The
original purpose was to protect the sorting algorithm from the
very worst case input: a string of repeated symbols. But
algorithm steps Q6a and Q6b in the original Burrows-Wheeler
technical report (SRC-124) show how repeats can be handled
without difficulty in block sorting.
-
The randomisation mechanism doesn't really need to be
there. Udi Manber and Gene Myers published a suffix array
construction algorithm a few years back, which can be employed
to sort any block, no matter how repetitive, in O(N log N)
time. Subsequent work by Kunihiko Sadakane has produced a
derivative O(N (log N)^2) algorithm which usually outperforms
the Manber-Myers algorithm.
I could have changed to Sadakane's algorithm, but I find
it to be slower than bzip2's
existing algorithm for most inputs, and the randomisation
mechanism protects adequately against bad cases. I didn't
think it was a good tradeoff to make. Partly this is due to
the fact that I was not flooded with email complaints about
bzip2-0.1's performance on
repetitive data, so perhaps it isn't a problem for real
inputs.
Probably the best long-term solution, and the one I have
incorporated into 0.9.5 and above, is to use the existing
sorting algorithm initially, and fall back to a O(N (log N)^2)
algorithm if the standard algorithm gets into
difficulties.
The compressed file format was never designed to be
handled by a library, and I have had to jump though some hoops
to produce an efficient implementation of decompression. It's
a bit hairy. Try passing
decompress.c through the C
preprocessor and you'll see what I mean. Much of this
complexity could have been avoided if the compressed size of
each block of data was recorded in the data stream.
An Adler-32 checksum, rather than a CRC32 checksum,
would be faster to compute.
It would be fair to say that the
bzip2 format was frozen before I
properly and fully understood the performance consequences of
doing so.
Improvements which I was able to incorporate into 0.9.0,
despite using the same file format, are:
Single array implementation of the inverse BWT. This
significantly speeds up decompression, presumably because it
reduces the number of cache misses.
Faster inverse MTF transform for large MTF values.
The new implementation is based on the notion of sliding blocks
of values.
bzip2-0.9.0 now reads
and writes files with fread
and fwrite; version 0.1 used
putc and
getc. Duh! Well, you live
and learn.
Further ahead, it would be nice to be able to do random
access into files. This will require some careful design of
compressed file formats.