bzip2 [ Home | Documentation | Downloads ]
2.5. MEMORY MANAGEMENT

2.5. MEMORY MANAGEMENT

bzip2 compresses large files in blocks. The block size affects both the compression ratio achieved, and the amount of memory needed for compression and decompression. The flags -1 through -9 specify the block size to be 100,000 bytes through 900,000 bytes (the default) respectively. At decompression time, the block size used for compression is read from the header of the compressed file, and bunzip2 then allocates itself just enough memory to decompress the file. Since block sizes are stored in compressed files, it follows that the flags -1 to -9 are irrelevant to and so ignored during decompression.

Compression and decompression requirements, in bytes, can be estimated as:

Compression:   400k + ( 8 x block size )

Decompression: 100k + ( 4 x block size ), or
               100k + ( 2.5 x block size )

Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. It is also important to appreciate that the decompression memory requirement is set at compression time by the choice of block size.

For files compressed with the default 900k block size, bunzip2 will require about 3700 kbytes to decompress. To support decompression of any file on a 4 megabyte machine, bunzip2 has an option to decompress using approximately half this amount of memory, about 2300 kbytes. Decompression speed is also halved, so you should use this option only where necessary. The relevant flag is -s.

In general, try and use the largest block size memory constraints allow, since that maximises the compression achieved. Compression and decompression speed are virtually unaffected by block size.

Another significant point applies to files which fit in a single block -- that means most files you'd encounter using a large block size. The amount of real memory touched is proportional to the size of the file, since the file is smaller than a block. For example, compressing a file 20,000 bytes long with the flag -9 will cause the compressor to allocate around 7600k of memory, but only touch 400k + 20000 * 8 = 560 kbytes of it. Similarly, the decompressor will allocate 3700k but only touch 100k + 20000 * 4 = 180 kbytes.

Here is a table which summarises the maximum memory usage for different block sizes. Also recorded is the total compressed size for 14 files of the Calgary Text Compression Corpus totalling 3,141,622 bytes. This column gives some feel for how compression varies with block size. These figures tend to understate the advantage of larger block sizes for larger files, since the Corpus is dominated by smaller files.

        Compress   Decompress   Decompress   Corpus
Flag     usage      usage       -s usage     Size

 -1      1200k       500k         350k      914704
 -2      2000k       900k         600k      877703
 -3      2800k      1300k         850k      860338
 -4      3600k      1700k        1100k      846899
 -5      4400k      2100k        1350k      845160
 -6      5200k      2500k        1600k      838626
 -7      6100k      2900k        1850k      834096
 -8      6800k      3300k        2100k      828642
 -9      7600k      3700k        2350k      828642

Copyright © 1996 - 2010  julian@bzip.org