Manual Reference Pages - BZIP (1)
bzip, bunzip - a block-sorting file compressor, v0.21
[ -cdfkvVL123456789 ] [
filenames ... ]
[ -kvVL ] [
filenames ... ]
Bzip compresses files using the Burrows-Wheeler-Fenwick block-sorting
text compression algorithm. Compression is generally considerably
better than that
achieved by more conventional LZ77/LZ78-based compressors,
and competitive with all but the best of the PPM family of statistical
The command-line options are deliberately very similar to
GNU Gzip, but they are not identical.
Bzip expects a list of file names to follow the command-line flags.
Each file is replaced by a compressed version of itself,
with the name "original_name.bz".
Each compressed file has the same modification date and permissions
as the corresponding original, so that these properties can be
correctly restored at decompression time. File name handling is
naive in the sense that there is no mechanism for preserving
original file names, permissions and dates in filesystems
which lack these concepts, or have serious file name length
restrictions, such as MS-DOS.
bunzip will not overwrite existing files; if you want this to happen,
you should delete them first.
If no file names are specified,
bzip compresses from standard input to standard output.
In this case,
bzip will decline to write compressed output to a terminal, as
this would be entirely incomprehensible and therefore pointless.
bzip -d ) decompresses and restores all specified files whose names
end in ".bz".
Files without this suffix are ignored.
Again, supplying no filenames
causes decompression from standard input to standard output.
You can also compress or decompress exactly one named file to
the standard output by giving the -c flag.
Compression is always performed, even if the compressed file is
slightly larger than the original. The worst case expansion is
for files of zero length, which expand to seventeen bytes.
Random data (including the output of most file compressors)
is coded at about 8.1 bits per byte, giving an expansion of
As a self-check for your protection,
bzip uses 32-bit CRCs to make sure that the decompressed
version of a file is identical to the original.
This guards against corruption of the compressed data,
and against undetected bugs in
bzip (hopefully very unlikely).
The chances of data corruption going undetected is
microscopic, about one chance in four billion
for each file processed. Be aware, though, that the check
occurs upon decompression, so it can only tell you that
that something is wrong. It cant help you recover the
original uncompressed data.
Return values: 1 for an abnormal exit, otherwise 0.
Bzip compresses large files in blocks. The block size affects both the
compression ratio achieved, and the amount of memory needed both for
compression and decompression. The flags -1 through -9
specify the block size to be 100,000 bytes through 900,000 bytes
(the default) respectively. At decompression-time, the block size used for
compression is read from the header of the compressed file, and
bunzip then allocates itself just enough memory to decompress the file.
Since block sizes are stored in compressed files, it follows that the flags
-1 to -9
are irrelevant to and so ignored during decompression.
Compression and decompression requirements, in bytes, can be estimated as:
Compression: 300k + ( 8 x block size )
Decompression: 6 x block size
The 300k constant is for a frequency-count
table, used in the sorting phase of compression.
Larger block sizes give rapidly diminishing marginal returns; most
compression comes from the first two or three hundred k of block size,
a fact worth bearing in mind when using
bzip on small machines. It is also important to appreciate that the
decompression memory requirement is set at compression-time by the
choice of block size. So, for example, if you are compressing
files which you think might possibly be decompressed on a 4-megabyte
machine, you might want to select a block size of 200k or 300k, so
the decompressor will draw 1200 kbytes or 1800 kbytes respectively,
which is probably the limit of whats comfortable on a 4-meg machine.
In general, though, you should try and use the largest block size
memory constraints allow. Compression and decompression
speed is virtually unaffected by block size.
Another significant point applies to files which fit in a single
block -- that means most files youd encounter using a large
block size. The amount of real memory touched is proportional
to the size of the file, since the file is smaller than a block.
For example, compressing a file 20,000 bytes long with the flag
will cause the compressor to allocate [by the formula, in practice a
little more] 7500k of memory, but only touch 300k + 20000 * 8 = 460
kbytes of it. Similarly, the decompressor will allocate 5400k but
only touch 20000 * 6 = 120 kbytes.
Here is a table which summarises the maximum memory usage for
different block sizes. Also recorded is the total compressed
size for 14 files of the Calgary Text Compression Corpus
totalling 3,141,622 bytes. This column gives some feel for how
compression varies with block size. These figures tend to understate
the advantage of larger block sizes for larger files, since the
Corpus is dominated by smaller files.
Compress Decompress Corpus
Flag usage usage Size
-1 1100k 500k 905958
-2 1900k 1000k 870646
-3 2700k 1500k 853650
-4 3500k 2000k 840140
-5 4300k 2500k 838355
-6 5100k 3000k 831695
-7 5900k 3500k 827104
-8 6700k 4000k 821652
-9 7500k 4500k 821652
Compress or decompress to standard output. -c requires you to supply
exactly one file name, and this file is compressed or decompressed
to standard out.
bunzip are really the same program, and the decision about whether to
compress or decompress is done on the basis of which name is
used. This flag overrides that mechanism, and forces
bzip to decompress.
The complement to -d: forces compression, regardless of the invokation
Keep (dont delete) input files during compression or decompression.
Verbose mode -- show the compression ratio for each file processed.
Be very verbose. This spews out lots of information during
compression which is primarily of interest for debugging purposes.
Display the software license terms and conditions.
-1 to -9 ||
Set the block size to 100 k, 200 k .. 900 k when
compressing. Has no effect when decompressing.
See MEMORY MANAGEMENT above.
The sorting phase of compression gathers together similar strings
in the file. Because of this, files containing very long
runs of repeated symbols, like "aabaabaabaab ..." (repeated
several hundred times) may compress extraordinarily slowly.
You can use the
option to monitor progress in great detail, if you want.
Decompression speed is unaffected. Such pathological cases
seem rare in practice.
Incompressible or virtually-incompressible data may decompress
rather more slowly than one would hope. This is due to
naive implementation of the move-to-front coder, and of the
frequency tables for the arithmetic coder.
Decompression on Sun Sparc 1s (and other low-range Sparcs)
can be slow, because of the
lack of hardware implementations of integer multiply and divide
in the SPARC v7 instruction set. The situation is much exacerbated
bzip is compiled for a full SPARC v8 instruction set, since this causes
the machine to trap on each multiply and divide instruction.
These traps take control to the relevant software emulation
of the offending instruction, but it is much quicker for the
compiler simply to plant a call to the emulation routine.
Moral: be careful how you compile
bzip for a Sparc. If you use GNU C, investigate the effects of
the -msupersparc and -mcypress flags.
Wildcard expansion for Windows 95 and NT loses leading directory
information. For example, the pathspec "sources\*.c" is searched
correctly for matching files, but the "sources\" bit is ignored when
the files come to be processed, which means
bzip wont be able to find any of them. This is easy to fix; perhaps
some enterprising soul will send me a patch?
I/O error messages are not as helpful as they could be.
Bzip tries hard to detect I/O errors and exit cleanly, but the
details of what the problem is sometimes seem rather misleading.
There is no -t option to test the integrity of a compressed
file. However, Unix folks can do the following:
bzip -dcV file.bz > /dev/null
bzip to do a trial decompression of file.bz, throwing away
the result. Youll be shown the computed and stored CRCs.
If these are identical, the file is almost certainly OK --
see the discussion above on CRCs for a definition of
If theyre not,
bzip will complain loudly. Note that file.bz is left unchanged
regardless of the outcome. Win95/NT folks can do the same, but
/dev/null will have to be replaced with something suitable,
This manual page pertains to version 0.21 of
bzip. It may well happen that some future version will
use a different compressed file format. If you try to
decompress, using 0.21, a .bz file created with some
future version which uses a different compressed file format,
0.21 will complain that your file "is not a BZIP file".
If that happens, you should obtain a more recent version
bzip and use that to decompress the file.
Julian Seward, email@example.com.
The ideas embodied in
bzip are due to (at least) the following people:
Michael Burrows and David Wheeler (for the block sorting transformation),
Peter Fenwick (for the structured coding model, and many refinements),
Alistair Moffat, Radford Neal and Ian Witten (for the arithmetic
coder). I am much indebted for their help, support and advice.
See the file ALGORITHMS in the source distribution for pointers to
sources of documentation.
Christian von Roques encouraged me to look for faster
sorting algorithms, so as to speed up compression.
Many people sent patches, helped with portability problems,
lent machines, gave advice and were generally helpful
Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.