bzip2

The bzip2 file compression program was developed by Julian Seward and launched on the 18th of July in 1996. It has remained an open source program, available to all for free, for over twenty two years now. The last stable release was seven years ago. The version 1.0.6 was released on the 20th of September in 2010. bzip2 compression program is based on Burrows–Wheeler algorithm. The program can compress files but cannot archive them. Julian Seward is still in charge of maintaining the program. The compression application works on all major operating systems and is available as a BSD-like license. The program uses .bz2 as its filename extension, application/x-bzip2 as the media type on internet and public.archive.bzip2 as the uniform type identifier.

bzip2 is suitable for power users. The command line enabled file compression program has fifteen options. Initiating and running the program is a cakewalk. Every option is well explained. The program can be used in batch files. It cannot recover from syntax errors but it can force compression and even decompress damaged archives. There are specific options to overwrite files, to suppress errors and compel compression. The simple operation and quick decompression or extraction would suit many heavy users. As per the information published by the developer, the file compression program is capable of compressing files down to 15% or 10% of other available techniques and operates at twice the compression speed and six times the decompression speed than gzip. Although not an archiving tool, the program can extract undamaged files in disks or tapes that have errors.

The file compression program can be downloaded and installed for free. It is compatible with Windows 2000 and XP, subsequently all later versions. There are no specific requirements or additional specifications that must be satiated. The total size of the file to be downloaded is 76KB. You would use the bzip2-105-x86-win32.exe file to install. bzip2 is used by a few million people around the world. bzip2 file compression can convert almost everything that you may test it with. The simplicity and reliability of the program have earned it many laurels but there has been some criticism for its lack of archiving ability. The efficacy of the program to extract some damaged parts of files in otherwise inaccessible tapes or discs is worthwhile. This blog was created as an informational portal on bzip2. Scott Chow has a good guide on how to start a blog that is suitable for beginners if you are looking to start your own blog on file compression or any other topic.

There has been no major update to the file compression program in recent years. This is partly because such updates have been unnecessary. It can still work on all popular files and can compress or decompress with utmost ease. The space you can save with files compressed using bzip2 is sufficient when you factor in heavy documents or materials. There is no premium version of the file compression program so you can bid adieu to the recurrent messages prompting you to pay up and upgrade. There is no advanced or full version. You get everything there is in the free and open source version of bzip2. In fact, there is only one version you can download and use. It has been criticized by some for having only one algorithm powering the compression.

Comprehensive Guide to bzip2

bzip2 became quite popular in the late nineties and the subsequent updates made it more widespread. The program is more effective than Deflate and LZW programs but is also slower. The LZW or .z and the Deflate algorithms such as .gz and .zip are less effective but they operate quickly. As a result they end up taking more space than what bzip2 can achieve. The bzip2 compression program facilitates considerably faster decompression compared to its pace of compression.

bzip2 file compression program works on data in different sizes of blocks, usually from 100 to 900 kilobytes. It relies on Burrows–Wheeler transform or algorithm to convert all character sequences recurring frequently into identical letters strings. The program then uses the Huffman coding move to front transform. bzip, which was the predecessor of bzip2, employed arithmetic coding but the successor uses Huffman coding. The performance of bzip2 is asymmetric. It has a relatively fast decompression. For a while in the early years of the century, the program did factor in the scope of multi threading and it was aimed at linear speed enhancements on multi core or multi cpu computers but this functionality has not been made available in subsequent versions released by the developer.

bzip2 is similar to gzip. Both are data compressors. bzip2 is not like zip or tar. Those have archiving ability. bzip2 is also only meant for single files. It cannot work on multiple files, archive splitting or encryption. It can work with external utilities like GnuPG and tar to facilitate such tasks. In a way, bzip2 stays true to its UNIX tradition.

bzip2 employs a compression stack technique that involves several layers atop one another and then the decompression follows a reverse order. The program carries out run length encoding of the initial data, applies the Burrows–Wheeler algorithm which is also known as block sorting, goes for the move to front transform and runs run length encoding of the MTF outcome, then applies the Huffman coding before selecting from different tables and using a unary base 1 encoding of the selected table, subsequently delta encoding of the code bit lengths and sparse bit array, which shows the used symbols.

There is no formal file format or specification but a reverse engineered informal spec is used. There is a four byte header for every .bz2 stream with none or some compressed blocks and a marker at the end of the stream that has a thirty two bit CRC. The signature or magic number of bzip2 is BZh. Many programs support the bzip2 file compression format including 7-Zip, micro-bzip2, Pbzip2, bzip2smp, smpbzip2, pyflate, bz2, Arnaud Bouchez’s bzip, lbzip2, mpibzip2, Apache Commons, jbzip2, DotNetZip and DotNetCompression.

Technical Overview of bzip2

The initial run length encoding has a sequence of four to two hundred and fifty five duplicate symbols. The consecutive sequence is then replaced by four symbols with a repeat length varying from zero to two hundred and fifty one. For instance, AAAAAAABBBBCCCD sequence gets replaced by AAAA\3BBBB\0CCCD. The 3 and 0 are byte values. The symbols are transformed only after the first four symbols so the run length is zero and the transformation is reversible. bzip2 can cause an expansion of a file in worst case scenarios but up to 1.25. In best case scenarios, the reduction is less than 0.02. This run length encoding has been criticized and even Julian Seward had admitted that it was a mistake and was only applicable to avert pathological instances.

The Burrows–Wheeler algorithm is at the crux of the file compression program. It is the main block sort that determines reversibility. This block is self contained and both input & output buffers do not undergo any changes in size. The limit at this stage is preset at nine hundred kilobytes and it does not change. The move to front transform during the file compression also remains completely distant in terms of its impact on the size. There is no change to the processed block. The symbols are used in an array. Every replaced symbol during processing has a location or index as an integral part of the array. Since identical characters or symbols that recur immediately are replaced with zero symbols, the program can maintain a low range of integers and this simplifies the encoding. Any legacy method of compression can work on the data that is transformed using this system.