NAME

bzip, bunzip - a block-sorting file compressor, v0.21

SYNOPSIS

bzip [ -cdfkvVL123456789 ] [ filenames ... ]
bunzip [ -kvVL ] [ filenames ... ]

DESCRIPTION

Bzip compresses files using the Burrows-Wheeler-Fenwick block-sorting text compression algorithm. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and competitive with all but the best of the PPM family of statistical compressors.

The command-line options are deliberately very similar to those of GNU Gzip, but they are not identical.

Bzip expects a list of file names to follow the command-line flags. Each file is replaced by a compressed version of itself, with the name "original_name.bz". Each compressed file has the same modification date and permissions as the corresponding original, so that these properties can be correctly restored at decompression time. File name handling is naive in the sense that there is no mechanism for preserving original file names, permissions and dates in filesystems which lack these concepts, or have serious file name length restrictions, such as MS-DOS.

Bzip and bunzip will not overwrite existing files; if you want this to happen, you should delete them first.

If no file names are specified, bzip compresses from standard input to standard output. In this case, bzip will decline to write compressed output to a terminal, as this would be entirely incomprehensible and therefore pointless.

Bunzip (or bzip -d ) decompresses and restores all specified files whose names end in ".bz". Files without this suffix are ignored. Again, supplying no filenames causes decompression from standard input to standard output.

You can also compress or decompress exactly one named file to the standard output by giving the -c flag.

Compression is always performed, even if the compressed file is slightly larger than the original. The worst case expansion is for files of zero length, which expand to seventeen bytes. Random data (including the output of most file compressors) is coded at about 8.1 bits per byte, giving an expansion of around 1%.

As a self-check for your protection, bzip uses 32-bit CRCs to make sure that the decompressed version of a file is identical to the original. This guards against corruption of the compressed data, and against undetected bugs in bzip (hopefully very unlikely). The chances of data corruption going undetected is microscopic, about one chance in four billion for each file processed. Be aware, though, that the check occurs upon decompression, so it can only tell you that that something is wrong. It can't help you recover the original uncompressed data.

Return values: 1 for an abnormal exit, otherwise 0.

MEMORY MANAGEMENT

Bzip compresses large files in blocks. The block size affects both the compression ratio achieved, and the amount of memory needed both for compression and decompression. The flags -1 through -9 specify the block size to be 100,000 bytes through 900,000 bytes (the default) respectively. At decompression-time, the block size used for compression is read from the header of the compressed file, and bunzip then allocates itself just enough memory to decompress the file. Since block sizes are stored in compressed files, it follows that the flags -1 to -9 are irrelevant to and so ignored during decompression. Compression and decompression requirements, in bytes, can be estimated as:

Compression: 300k + ( 8 x block size )

Decompression: 6 x block size

The 300k constant is for a frequency-count table, used in the sorting phase of compression.

Larger block sizes give rapidly diminishing marginal returns; most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip on small machines. It is also important to appreciate that the decompression memory requirement is set at compression-time by the choice of block size. So, for example, if you are compressing files which you think might possibly be decompressed on a 4-megabyte machine, you might want to select a block size of 200k or 300k, so the decompressor will draw 1200 kbytes or 1800 kbytes respectively, which is probably the limit of what's comfortable on a 4-meg machine. In general, though, you should try and use the largest block size memory constraints allow. Compression and decompression speed is virtually unaffected by block size.

Another significant point applies to files which fit in a single block -- that means most files you'd encounter using a large block size. The amount of real memory touched is proportional to the size of the file, since the file is smaller than a block. For example, compressing a file 20,000 bytes long with the flag -9 will cause the compressor to allocate [by the formula, in practice a little more] 7500k of memory, but only touch 300k + 20000 * 8 = 460 kbytes of it. Similarly, the decompressor will allocate 5400k but only touch 20000 * 6 = 120 kbytes.

Here is a table which summarises the maximum memory usage for different block sizes. Also recorded is the total compressed size for 14 files of the Calgary Text Compression Corpus totalling 3,141,622 bytes. This column gives some feel for how compression varies with block size. These figures tend to understate the advantage of larger block sizes for larger files, since the Corpus is dominated by smaller files.

Compress Decompress Corpus Flag usage usage Size

-1 1100k 500k 905958 -2 1900k 1000k 870646 -3 2700k 1500k 853650 -4 3500k 2000k 840140 -5 4300k 2500k 838355 -6 5100k 3000k 831695 -7 5900k 3500k 827104 -8 6700k 4000k 821652 -9 7500k 4500k 821652

OPTIONS

-c: Compress or decompress to standard output. -c requires you to supply exactly one file name, and this file is compressed or decompressed to standard out.
-d: Force decompression. Bzip and bunzip are really the same program, and the decision about whether to compress or decompress is done on the basis of which name is used. This flag overrides that mechanism, and forces bzip to decompress.
-f: The complement to -d: forces compression, regardless of the invokation name.
-k: Keep (don't delete) input files during compression or decompression.
-v: Verbose mode -- show the compression ratio for each file processed.
-V: Be very verbose. This spews out lots of information during compression which is primarily of interest for debugging purposes.
-L: Display the software license terms and conditions.
-1 to -9: Set the block size to 100 k, 200 k .. 900 k when compressing. Has no effect when decompressing. See MEMORY MANAGEMENT above.

PERFORMANCE NOTES

The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress extraordinarily slowly. You can use the -V option to monitor progress in great detail, if you want. Decompression speed is unaffected. Such pathological cases seem rare in practice.

Incompressible or virtually-incompressible data may decompress rather more slowly than one would hope. This is due to naive implementation of the move-to-front coder, and of the frequency tables for the arithmetic coder.

Decompression on Sun Sparc 1's (and other low-range Sparcs) can be slow, because of the lack of hardware implementations of integer multiply and divide in the SPARC v7 instruction set. The situation is much exacerbated if bzip is compiled for a full SPARC v8 instruction set, since this causes the machine to trap on each multiply and divide instruction. These traps take control to the relevant software emulation of the offending instruction, but it is much quicker for the compiler simply to plant a call to the emulation routine. Moral: be careful how you compile bzip for a Sparc. If you use GNU C, investigate the effects of the -msupersparc and -mcypress flags.

Wildcard expansion for Windows 95 and NT loses leading directory information. For example, the pathspec "sources\*.c" is searched correctly for matching files, but the "sources\" bit is ignored when the files come to be processed, which means bzip won't be able to find any of them. This is easy to fix; perhaps some enterprising soul will send me a patch?

CAVEATS

I/O error messages are not as helpful as they could be. Bzip tries hard to detect I/O errors and exit cleanly, but the details of what the problem is sometimes seem rather misleading.

There is no -t option to test the integrity of a compressed file. However, Unix folks can do the following:

bzip -dcV file.bz > /dev/null

which causes bzip to do a trial decompression of file.bz, throwing away the result. You'll be shown the computed and stored CRCs. If these are identical, the file is almost certainly OK -- see the discussion above on CRCs for a definition of "almost certainly". If they're not, bzip will complain loudly. Note that file.bz is left unchanged regardless of the outcome. Win95/NT folks can do the same, but /dev/null will have to be replaced with something suitable, perhaps NUL.

This manual page pertains to version 0.21 of bzip. It may well happen that some future version will use a different compressed file format. If you try to decompress, using 0.21, a .bz file created with some future version which uses a different compressed file format, 0.21 will complain that your file "is not a BZIP file". If that happens, you should obtain a more recent version of bzip and use that to decompress the file.

AUTHOR

Julian Seward, sewardj@cs.man.ac.uk.

The ideas embodied in bzip are due to (at least) the following people: Michael Burrows and David Wheeler (for the block sorting transformation), Peter Fenwick (for the structured coding model, and many refinements), and Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic coder). I am much indebted for their help, support and advice. See the file ALGORITHMS in the source distribution for pointers to sources of documentation. Christian von Roques encouraged me to look for faster sorting algorithms, so as to speed up compression. Many people sent patches, helped with portability problems, lent machines, gave advice and were generally helpful.