 |
|
| |
samtools-checksum(1) |
Bioinformatics tools |
samtools-checksum(1) |
samtools checksum - produces checksums of SAM / BAM / CRAM
content
samtools checksum [options]
in.sam|in.bam|in.cram|in.fastq [ ... ]
samtools checksum -m [options] in.checksum [ ... ]
With no options, this produces an order agnostic checksum of
sequence, quality, read-name and barcode related aux data in a SAM, BAM,
CRAM or FASTQ file. The CRC32 checksum is used, combined together in a
multiplicative prime field of size (2<<31)-1.
The purpose of this mode is to validate that no data has been lost
in data processing through the various steps of alignment, sorting and
processing. Only primary alignments are recorded and the checksums computed
are order agnostic so the same checksums are produced in name collated or
position sorted output files.
One set of checksums is produced per read-group as well as a
combined file, plus a set for records that have no read-group assigned. This
allows for validation of merging multiple runs and splitting pools by their
read-group. The checksums are also reported for QC-pass only and QC-fail
only (indicated by the QCFAIL BAM flag), so checksums of data identified and
removed as contamination can also be tracked.
All of the above are compatible with Biobambam2's bamseqchksum
tool, which was the inspiration for this samtools command. The -B
option further enhances compatibility by using the same output format,
although it limits the functionality to the order agnostic checksums and
fewer types validated.
The -m or --merge option can be used to merge
previously generated checksums. The input filenames are checksum outputs
from this tool (via shell redirection or the -o) option. The intended
use of this is to validate no data is lost or corruption during file merging
of read-group specific files, by algorithmically computing the expected
checksum output.
Additionally checksum can track other columns including BAM
flags, mapping information (MAPQ and CIGAR), pair information (RNEXT, PNEXT
and TLEN), as well as a wider list of tags.
With the -O option the checksums become record order
specific. Combined together with the -a option this can be used to
validate SAM, BAM and CRAM format conversions. The CRCs per record are XORed
with a record counter for the Nth record per read group. See the detailed
description below for single -O vs double and the implications on
reordering between read-groups.
When performing such validation, it is also useful to enable data
sanitisation first, as CRAM can fix up certain types of inconsistencies
including common issues such as MAPQ and CIGAR strings for unaligned
data.
The output format consists of a machine readable table of
checksums and human readable text starting with a "#"
character.
For compatibility with bamseqchksum the data is CRCed in specific
orders before combining together to form a checksum column. The last column
reported is then the combination of all checksums in that row, permitting
easy comparison by looking at a single value.
The columns reported are as follows.
- Group
- The read group name. There is always an "all" group which
represents all records. This is followed by one checksum set per
read-group found in the file.
- QC
- This is either "all" or "pass". "Pass"
refers to records that do not have the QCFAIL BAM flag specified.
- flag+seq
- The checksum of SAM FLAG + SEQ fields
- +name
- The checksum of SAM QNAME + FLAG + SEQ fields
- +qual
- The checksum of SAM FLAG + SEQ + QUAL fields
- +aux
- The checksum of SAM FLAG + SEQ + selected auxiliary fields
- +chr/pos
- The checksum of SAM FLAG + SEQ + RNAME (chromosome) + POSition fields
- +mate
- The checksum of SAM FLAG + SEQ + RNEXT + PNEXT + ISIZE fields.
- combined
- The combined checksum of all columns prior to this column. The first row
will be for all alignments, so the combined checksum on the first row may
be used as a whole file combined checksum.
An example output can be seen below.
# Checksum for file: NA12892.chrom20.ILLUMINA.bwa.CEU.high_coverage.bam
# Aux tags: BC,FI,QT,RT,TC
# BAM flags: PAIRED,READ1,READ2
# Group QC count flag+seq +name +qual +aux combined
all all 42890086 71169bbb 633fd9f7 2a2e693f 71169bbb 09d03ed4
SRR010946 all 262249 2957df86 3b6dcbc9 66be71f7 2957df86 58e89c25
SRR002165 all 97846 47ff17e0 6ff8fc7b 58366bf5 47ff17e0 796eecb0
[...cut...]
- -@ COUNT
- Uses COUNT compute threads in decoding the file. Typically this
does not gain much speed beyond 2 or 3. The default is to use a single
thread.
- -B,
--bamseqchksum
- Produces a report compatible with biobambam2's bamseqchksum default
output. Note this is only expected to work if no other format options have
been enabled. Specifically the header line is not updated to reflect
additional columns if requested.
Bamseqchksum has more output modes and many alternative
checksums. We only support the default CRC32 method.
- -F FLAG,
--exclude-flags FLAG
- Specifies which alignment FLAGs to filter out. This defaults to
secondary and supplementary alignments (0x900) as these can be duplicates
of the primary alignment. This ensures the same number of records are
checksummed in unaligned and aligned files.
- -f FLAG,
--require-flags FLAG
- A list of FLAGs that are required. Defaults to zero. An example use
of this may be to checksum QCFAIL only.
- -b FLAG,
--flag-mask FLAG
- The BAM FLAG is masked first before checksumming. The unaligned
flags will contain data about the sequencing run - whether it is paired in
sequencing and if so whether this is READ1 or READ2. These flags will not
change post-alignment and so everything except these three are masked out.
FLAG defaults to PAIRED,READ1,READ2 (0xc1).
- -c,
--no-rev-comp
- By default the sequence and quality strings are reverse complemented
before checksumming, so unaligned data does not affect the checksums. This
option disables this and checksums as-is.
- -t STR, --tags
STR
- Specifies a comma-separated list of aux tags to checksum. These are
concatenated together in their canonical BAM encoding in the order listed
in STR, prior to computing the checksums.
If STR begins with "*" then all tags are
used. This can then be followed by a comma separated list of tags to
exclude. For example "*,MD,NM" is all tags except MD and NM.
In this mode, the tags are combined in alphanumeric order.
The default value is "BC,FI,QT,RT,TC".
- -O,
--in-order
-
By default the CRCs are combined in a multiplicative field
that is order agnostic, as multiplication is an associative operation.
This option XORs the CRC with the a number indicating the Nth record
number for this grouping prior to the multiply step, making the final
multiplicative checksum dependent on the order of the input data.
For the "all" row the count is taken from the Nth
record in the read-group associated with this record (or the
"-" row for read-group-less data). This ensures that the
checksums can be subsequently merged together algorithmically using the
-m option, but it does mean there is no validation of record
swaps between read-groups. Note however due to the way ties are
resolved, when running samtools merge out.bam rg1.bam rg2.bam we
may get different orderings if we merged the two files in the opposite
order. This can happen when two read-groups have alignments at the same
position with the same BAM flags. Hence if we wish to check a
samtools split followed by samtools merge round trip works
then this counter per readgroup is a benefit.
However, if absolute ordering needs to be validated regardless
of read-groups, specifying the -O option twice will compute the
"all" row by combining the CRC with the Nth record in the file
rather than the Nth record in its readgroup. This output can no longer
can merged using checksum -m.
- -P,
--check-pos
- Adds a column to the output with combined chromosome and position
checksums. This also incorporates the flag/sequence CRC.
- -C,
--check-cigar
- Adds a column to the output with combined mapping quality and CIGAR
checksums. This also incorporates the flag/sequence CRC.
- -M,
--check-mate
- Adds a column to the output with combined mate reference, mate position
and template length checksums. This also incorporates the flag/sequence
CRC.
- -b FLAGS,
--sanitize FLAGS
- Perform data sanitization prior to checksumming. This is off by default.
See samtools view for the FLAG terms accepted.
- -N COUNT, --count
COUNT
- Limits the checksumming to the first COUNT records from the file.
- -a, --all
- Checksum all data. This is equivalent to -PCMOc -b 0xfff -f0 -F0
-z all,cigarx -t *,cF,MD,NM. It is useful for validating
round-trips between file formats, such as BAM to CRAM.
- -T, --tabs
- Use tabs for separating columns instead of aligned spaces.
- -q, --show-qc
- Also show QC pass and fail rows per read-group. These are based on the
QCFAIL BAM flag.
- -o FILE, --output FILE
- Output checksum report to FILE instead of stdout.
- -m FILE, --merge
FILE...
- Merge checksum outputs produced by the -o option. This can be used
to simulate or validate the effect of computing checksum on the output of
a samtools merge command.
The columns to report are read from the "# Group"
line. The rows to report are still governed by the -q, -v
and -T options so this can also be used for reformatting of a
single file.
Note the "all" row merging cannot be done when the
two levels of order-specific checksums (-OO) has been used.
- -v, --verbose
- Increase verbosity. At level 1 or higher this also shows rows that have
zero count values, which can aid machine parsing.
- o
- To check that an aligned and position sorted file contains the same data
as the pre-alignment FASTQ:
samtools checksum -q pos-aln.bam
samtools import -u -1 rg1.fastq.gz -2 rg2.fastq.gz | samtools checksum -q
The output for this consists of some human readable comments
starting with "#" and a series of checksum lines per
read-group and QC status.
# Checksum for file: SRR554369.P_aeruginosa.cram
# Aux tags: BC,FI,QT,RT,TC
# BAM flags: PAIRED,READ1,READ2
# Group QC count flag+seq +name +qual +aux combined
all all 3315742 4a812bf2 22d15cfe 507f0f57 4a812bf2 035e2f5b
all pass 3315742 4a812bf2 22d15cfe 507f0f57 4a812bf2 035e2f5b
Note as no barcode tags exist, the "+aux" column is
the same as the "flag+seq" column it is based upon.
- o
- To check round-tripping from BAM to CRAM and back again we can convert the
BAM to CRAM and then run the checksum on the CRAM file. This does not need
explicitly converting back to BAM as htslib will decode the CRAM and
convert it back to the same in-memory representation that is utilised in
BAM.
samtools checksum -a 9827_2#49.1m.bam
[...cut...]
samtools view -@8 -C -T $HREF 9827_2#49.1m.bam | samtools checksum -a
# Checksum for file: -
# Aux tags: *,cF,MD,NM
# BAM flags: PAIRED,PROPER_PAIR,UNMAP,MUNMAP,REVERSE,MREVERSE,READ1,READ2,SECONDARY,QCFAIL,DUP,SUPPLEMENTARY
# Group QC count flag+seq +name +qual +aux +chr/pos +cigar +mate combined
all all 99890 066a0706 0805371d 5506e19f 6b0eec58 60e2347c 09a2c3ba 347a3214 66c5e2de
1#49 all 99890 066a0706 0805371d 5506e19f 6b0eec58 60e2347c 09a2c3ba 347a3214 66c5e2de
- o
- To validate that splitting a file by regroup retains all the data, we can
compute checksums on the split BAMs and merge the checksum reports
together to compare against the original unsplit file. (Note in the
example below diff will report the filename changing, which is expected.)
samtools split -u /tmp/split/noRG.bam -f '/tmp/split/%!.%.' in.cram
samtools checksum -a in.cram -o in.chksum
s=$(for i in /tmp/split/*.bam;do echo "<(samtools checksum -a $i)";done)
eval samtools checksum -m $s -o split.chksum
diff in.chksum split.chksum
Written by James Bonfield from the Sanger Institute.
Inspired by bamseqchksum, written by David Jackson of Sanger Institute and
amended by German Tischler.
samtools(1), samtools-view(1),
Samtools website: <http://www.htslib.org/>
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
|