samtools cram-size - list a break down of data types in a CRAM
file
samtools cram-size [-ve] [-o file]
in.bam
Produces a summary of CRAM block Content ID numbers and their
associated Data Series stored within them. Optionally a more detailed
breakdown of how each data series is encoded per container may also be
listed using the -e or --encodings option.
CRAM permits mixing multiple Data Series into a single block. In
this case it is not possible to tell the relative proportion that the Data
Series consume within that block. CRAM also permits different encodings and
block Content ID assignment per container, although this would be highly
unusual. Htslib will always assign the same Data Series to a block with a
consistent Content ID, although the CRAM Encoding may change.
Each CRAM block has a compression method. These may not be
consistent between successive blocks with the same Content ID. Htslib learns
which compression methods work, so a single Content ID may have multiple
compression methods associated with it. The methods utilised are listed per
line with a single character code, although the size breakdown per method
and a more verbose description can be shown using the -v option. The
compression codecs used in CRAM may have a variety of parameters, such as
compression levels, inbuilt transformations, and choices of entropy
encoding. An attempt is made to distinguish between these different method
parameterisations.
The compression methods and their short and long (verbose) name
are below:
Short |
Long |
Description |
g |
gzip |
Gzip |
_ |
gzip-min |
Gzip -1 |
G |
gzip-max |
Gzip -9 |
b |
bzip2 |
Bzip2 |
b |
bzip2-1 to bzip2-8 |
Explicit bzip2 compression levels |
B |
bzip2-9 |
Bzip2 -9 |
l |
lzma |
LZMA |
r |
r4x8-o0 |
rANS 4x8 Order-0 |
R |
r4x8-o1 |
rANS 4x8 Order-1 |
0 |
r4x16-o0 |
rANS 4x16 Order-0 |
0 |
r4x16-o0R |
rANS 4x16 Order-0 with RLE |
0 |
r4x16-o0P |
rANS 4x16 Order-0 with PACK |
0 |
r4x16-o0PR |
rANS 4x16 Order-0 with PACK and RLE |
1 |
r4x16-o1 |
rANS 4x16 Order-1 |
1 |
r4x16-o1R |
rANS 4x16 Order-1 with RLE |
1 |
r4x16-o1P |
rANS 4x16 Order-1 with PACK |
1 |
r4x16-o1PR |
rANS 4x16 Order-1 with PACK and RLE |
4 |
r32x16-o0 |
rANS 32x16 Order-0 |
4 |
r32x16-o0R |
rANS 32x16 Order-0 with RLE |
4 |
r32x16-o0P |
rANS 32x16 Order-0 with PACK |
4 |
r32x16-o0PR |
rANS 32x16 Order-0 with PACK and RLE |
5 |
r32x16-o1 |
rANS 32x16 Order-1 |
5 |
r32x16-o1R |
rANS 32x16 Order-1 with RLE |
5 |
r32x16-o1P |
rANS 32x16 Order-1 with PACK |
5 |
r32x16-o1PR |
rANS 32x16 Order-1 with PACK and RLE |
8 |
rNx16-xo0 |
rANS Nx16 STRIPED mode |
2 |
rNx16-cat |
rANS Nx16 CAT mode |
a |
arith-o0 |
Arithmetic coding Order-0 |
a |
arith-o0R |
Arithmetic coding Order-0 with RLE |
a |
arith-o0P |
Arithmetic coding Order-0 with PACK |
a |
arith-o0PR |
Arithmetic coding Order-0 with PACK and RLE |
A |
arith-o1 |
Arithmetic coding Order-1 |
A |
arith-o1R |
Arithmetic coding Order-1 with RLE |
A |
arith-o1P |
Arithmetic coding Order-1 with PACK |
A |
arith-o1PR |
Arithmetic coding Order-1 with PACK and RLE |
a |
arith-xo0 |
Arithmetic coding STRIPED mode |
a |
arith-cat |
Arithmetic coding CAT mode |
f |
fqzcomp |
FQZComp quality codec |
n |
tok3-rans |
Name tokeniser with rANS encoding |
n |
tok3-arith |
Name tokeniser with Arithmetic encoding |
- -o FILE
- Output size information to FILE.
- -v
- Verbose mode. This shows one line per combination of Content ID and
compression method.
- -e, --encodings
- CRAM uses an Encoding, which describes how the data is serialised into a
data block. This is distinct from the CRAM compression method, which is
then applied to the block post-encoding. The encoding methods are stored
per CRAM Container.
This option list CRAM record encoding map and tag encoding
map. This shows the data series, the associated CRAM encoding method,
such as HUFFMAN, BETA or EXTERNAL, and any parameters associated with
that encoding. The output may be large as this is information per
container rather than a single set of summary statistics at the end of
processing.
- -
- The basic summary of block Content ID sizes for a CRAM file:
$ samtools cram-size in.cram
# Content_ID Uncomp.size Comp.size Ratio Method Data_series
BLOCK CORE 0 0 100.00% .
BLOCK 11 394734019 51023626 12.93% g RN
BLOCK 12 1504781763 99158495 6.59% R QS
BLOCK 13 330065 84195 25.51% _r.g IN
BLOCK 14 26625602 6803930 25.55% Rrg SC
...
- -
- Show the same file above with verbose mode. Here we see the distinct
compression methods which have been used per block Content ID.
$ samtools cram-size -v in.cram
# Content_ID Uncomp.size Comp.size Ratio Method Data_series
BLOCK CORE 0 0 100.00% raw
BLOCK 11 394734019 51023626 12.93% gzip RN
BLOCK 12 1504781763 99158495 6.59% r4x8-o1 QS
BLOCK 13 275033 64343 23.39% gzip-min IN
BLOCK 13 43327 15412 35.57% r4x8-o0 IN
BLOCK 13 2452 2452 100.00% raw IN
BLOCK 13 9253 1988 21.49% gzip IN
BLOCK 14 23106404 5903351 25.55% r4x8-o1 SC
BLOCK 14 1951616 513722 26.32% r4x8-o0 SC
BLOCK 14 1567582 386857 24.68% gzip SC
...
- -
- List encoding methods per CRAM Data Series. The two letter series are the
standard CRAM Data Series and the three letter ones are the optional
auxiliary tags with the tag name and type combined.
$ samtools cram-size -e in.cram
Container encodings
RN BYTE_ARRAY_STOP(stop=0,id=11)
QS EXTERNAL(id=12)
IN BYTE_ARRAY_STOP(stop=0,id=13)
SC BYTE_ARRAY_STOP(stop=0,id=14)
BB BYTE_ARRAY_LEN(len_codec={EXTERNAL(id=42)}, \
val_codec={EXTERNAL(id=37)}
...
XAZ BYTE_ARRAY_STOP(stop=9,id=5783898)
MDZ BYTE_ARRAY_STOP(stop=9,id=5063770)
ASC BYTE_ARRAY_LEN(len_codec={HUFFMAN(codes={1},lengths={0})}, \
val_codec={EXTERNAL(id=4281155)}
...
Written by James Bonfield from the Sanger Institute.
samtools(1),
Samtools website: <http://www.htslib.org/>