 |
|
| |
samtools-consensus(1) |
Bioinformatics tools |
samtools-consensus(1) |
samtools consensus - produces a consensus FASTA/FASTQ/PILEUP
samtools consensus [-saAMq] [-r region]
[-f format] [-l line-len] [-d
min-depth] [-C cutoff] [-c call-fract]
[-H het-fract] in.bam
Generate consensus from a SAM, BAM or CRAM file based on the
contents of the alignment records. The consensus is written either as FASTA,
FASTQ, or a pileup oriented format. This is selected using the -f
FORMAT option.
The default output for FASTA and FASTQ formats include one base
per non-gap consensus. Hence insertions with respect to the aligned
reference will be included and deletions removed. This behaviour can be
controlled with the --show-ins and --show-del options. This
could be used to compute a new reference from sequences assemblies to
realign against.
The pileup-style format strictly adheres to one row per consensus
location, differing from the one row per reference based used in the related
"samtools mpileup" command. This means the base quality values for
inserted columns are reported. The base quality value of gaps (either within
an insertion or otherwise) are determined as the average of the surrounding
non-gap bases. The columns shown are the reference name, position, nth base
at that position (zero if not an insertion), consensus call, consensus
confidence, sequences and quality values. Note even when a reference is
supplied, the consensus base is always report (if non-zero depth) in the 5th
column.
Two consensus calling algorithms are offered. The default computes
a heterozygous consensus in a Bayesian manner, derived from the
"Gap5" consensus algorithm. Quality values are also tweaked to
take into account other nearby low quality values. This can also be
disabled, using the --no-adj-qual option.
This method also utilises the mapping qualities, unless the
--no-use-MQ option is used. Mapping qualities are also auto-scaled to
take into account the local reference variation by processing the MD:Z tag,
unless --no-adj-MQ is used. Mapping qualities can be capped between a
minimum (--low-MQ) and maximum (--high-MQ), although the
defaults are liberal and trust the data to be true. Finally an overall scale
on the resulting mapping quality can be supplied (--scale-MQ,
defaulting to 1.0). This has the effect of favouring more calls with a
higher false positive rate (values greater than 1.0) or being more cautious
with higher false negative rates and lower false positive (values less than
1.0).
The second method is a simple frequency counting algorithm,
summing either +1 for each base type or +qual if the
--use-qual option is specified. This is enabled with the --mode
simple option.
The summed share of a specific base type is then compared against
the total possible and if this is above the --call-fract
fraction parameter then the most likely base type is called, or
"N" otherwise (or absent if it is a gap). The --ambig
option permits generation of ambiguity codes instead of "N",
provided the minimum fraction of the second most common base type to the
most common is above the --het-fract fraction.
General options that apply to both algorithms:
- -r REG, --region
REG
- Limit the query to region REG. This requires an index.
- -f FMT, --format
FMT
- Produce format FMT, with "fastq", "fasta" and
"pileup" as permitted options.
- -l N, --line-len
N
- Sets the maximum line length of line-wrapped fasta and fastq formats to
N.
- -o FILE, --output
FILE
- Output consensus to FILE instead of stdout.
- -m STR, --mode
STR
- Select the consensus algorithm. Valid modes are "simple"
frequency counting and the "bayesian" (Gap5) methods, with
Bayesian being the default. (Note case does not matter, so
"Bayesian" is accepted too.) There are a variety of bayesian
methods. Straight "bayesian" is the best set suitable for the
other parameters selected. The choice of internal parameters may change
depending on the "--P-indel" score. This method distinguishes
between substitution and indel error rates. The old Samtools consensus in
version 1.16 did not distinguish types of errors, but for compatibility
the "bayesian_116" mode may be selected to replicate this.
- -a
- Outputs all bases, from start to end of reference, even when the aligned
data does not extend to the ends. This is most useful for construction of
a full length reference sequence.
- -a -a, -aa
- Output absolutely all positions, including references with no data aligned
against them.
- --rf, --incl-flags
STR|INT
- Only include reads with at least one FLAG bit set. Defaults to zero, which
filters no reads.
- --ff, --excl-flags
STR|INT
- Exclude reads with any FLAG bit set. Defaults to
"UNMAP,SECONDARY,QCFAIL,DUP".
- --min-MQ
INT
- Filters out reads with a mapping quality below INT. This defaults
to zero.
- --min-BQ
INT
- Filters out bases with a base quality below INT. This defaults to
zero.
- --show-del
yes/no
- Whether to show deletions as "*" (yes) or to omit from the
output (no). Defaults to no.
- --show-ins
yes/no
- Whether to show insertions in the consensus. Defaults to yes.
- --mark-ins
- Insertions, when shown, are normally recorded in the consensus with plain
7-bit ASCII (ACGT, or acgt if heterozygous). However this makes it
impossible to identify the mapping between consensus coordinates and the
original reference coordinates. If fasta output is selected then the
option adds an underscore before every inserted base, plus a corresponding
character in the quality for fastq format. When used in conjunction with
-a --show-del yes, this permits an easy derivation of the consensus
to reference coordinate mapping.
- -A, --ambig
- Enables IUPAC ambiguity codes in the consensus output. Without this the
output will be limited to A, C, G, T, N and *.
- -d D, --min-depth
D
- The minimum depth required to make a call. Defaults to 1. Failing this
depth check will produce consensus "N", or absent if it is an
insertion. Note this check is performed after filtering by flags and
mapping/base quality.
- -T ref.fa,
--reference ref.fa
- For base positions with zero coverage, use the supplied reference instead
of "N". Note this does not replace minimum depth or minimum
quality filters as the base is known but considiered low quality so the
ambiguity is retained.
- --ref-qual
INT
- When --reference is given this specifies the quality value to use
for reference-derived bases. This defaults to zero.
- The following options apply only
to the simple consensus mode:
-
- -q,
--use-qual
- For the simple consensus algorithm, this enables use of base quality
values. Instead of summing 1 per base called, it sums the base quality
instead. These sums are also used in the --call-fract and
--het-fract parameters too. Quality values are always used for the
"Gap5" consensus method and this option has no effect. Note
currently quality values only affect SNPs and not inserted sequences,
which still get scores with a fixed +1 per base type occurrence.
- -H H, --het-fract
H
- For consensus columns containing multiple base types, if the second most
frequent type is at least H fraction of the most common type then a
heterozygous base type will be reported in the consensus. Otherwise the
most common base is used, provided it meets the --call-fract
parameter (otherwise "N"). The fractions computed may be
modified by the use of quality values if the -q option is enabled.
Note although IUPAC has ambiguity codes for A,C,G,T vs any other A,C,G,T
it does not have codes for A,C,G,T vs gap (such as in a heterozygous
deletion). Given the lack of any official code, we use lower-case letter
to symbolise a half-present base type.
- -c C, --call-fract
C
- Only used for the simple consensus algorithm. Require at least C
fraction of bases agreeing with the most likely consensus call to emit
that base type. This defaults to 0.75. Failing this check will output
"N".
- -@ NTHREADS
- Specify the number of additional threads to use for computing the
consensus. Note if no index is present threads will only be used for
parallel decompression meaning asking for more than 2 threads is unlikely
to speed up processing. With an index the consensus is computed for
multiple regions simultaneously, offering near linear speed ups.
- -Z BASE_COUNT
- When using multiple threads this specifies the number of bases per
threading job. The default is 500,000 bp for fasta/fastq output and
100,000 for pileup output. Larger blocks may yield improved threading
performance at a cost of more memory.
- The following options apply
only to Bayesian consensus mode enabled
- (default on).
- -C C, --cutoff
C
- Only used for the Gap5 consensus mode, which produces a Phred style score
for the final consensus quality. If this is below C then the
consensus is called as "N".
- --use-MQ,
--no-use-MQ
- Enable or disable the use of mapping qualities. Defaults to on.
- --adj-MQ,
--no-adj-MQ
- If mapping qualities are used, this controls whether they are scaled by
the local number of mismatches to the reference. The reference is unknown
by this tool, so this data is obtained from the MD:Z auxiliary tag (or
ignored if not present). Defaults to on.
- --NM-halo
INT
- Specifies the distance either side of the base call being considered for
computing the number of local mismatches.
- --low-MQ
MIN, --high-MQ MAX
- Specifies a minimum and maximum value of the mapping quality. These are
not filters and instead simply put upper and lower caps on the values. The
defaults are 0 and 60.
- --scale-MQ
FLOAT
- This is a general multiplicative mapping quality scaling factor. The
effect is to globally raise or lower the quality values used in the
consensus algorithm. Defaults to 1.0, which leaves the values unchanged.
- --P-het
FLOAT
- Controls the likelihood of any position being a heterozygous site. This is
used in the priors for the Bayesian calculations, and has little
difference on deep data. Defaults to 1e-3. Smaller numbers makes the
algorithm more likely to call a pure base type. Note the algorithm will
always compute the probability of the base being homozygous vs
heterozygous, irrespective of whether the output is reported as ambiguous
(it will be "N" if deemed to be heterozygous without
--ambig mode enabled).
- --P-indel
FLOAT
- Controls the likelihood of small indels. This is used in the priors for
the Bayesian calculations, and has little difference on deep data.
Defaults to 2e-4.
- --het-scale
FLOAT
- This is a multiplicative correction applied per base quality before adding
to the heterozygous hypotheses. Reducing it means fewer heterozygous calls
are made. This oftens leads a significant reduction in false positive het
calls, for some increase in false negatives (mislabelling real
heterozygous sites as homozygous). It is usually beneficial to reduce this
on instruments where a significant proportion of bases may be aligned in
the wrong column due to insertions and deletions leading to alignment
errors and reference bias. It can be considered as a het sensitivity
tuning parameter. Defaults to 1.0 (nop).
- -p,
--homopoly-fix
- Some technologies that call runs of the same base type together always put
the lowest quality calls at one end. This can cause problems when reverse
complementing and comparing alignments with indels. This option averages
the qualities at both ends to avoid orientation biases. Recommended for
old 454 or PacBio HiFi data sets.
- --homopoly-score
FLOAT
- The -p option also reduces confidence values within homopolymers
due to an additional likelihood of sequence specific errors. The quality
values are multiplied by FLOAT. This defaults to 0.5, but is not
used if -p was not specified. Adjusting this score also
automatically enables -p.
- -t, --qual-calibration
FILE
- Loads a quality calibration table from FILE. The format of this is
a series of lines with the following fields, each starting with the
literal text "QUAL":
QUAL value substitution undercall
overcall
Lines starting with a "#" are ignored. Each line
maps a recorded quality value to the Phred equivalent score for
substitution, undercall and overcall errors. Quality values are
expected to be sorted in increasing numerical order, but may skip
values. This allows the consensus algorithm to know the most likely
cause of an error, and whether the instrument is more likely to have
indel errors (more common in some long read technologies) or
substitution errors (more common in clocked short-read instruments).
Some pre-defined calibration tables are built in. These are
specified with a fake filename starting with a colon. See the -X
option for more details.
Note due to the additional heuristics applied by the consensus
algorithm, these recalibration tables are not a true reflection of the
instrument error rates and are a work in progress.
- -X, --config
STR
- Specifies predefined sets of configuration parameters. Acceptable values
for STR are defined below, along with the list of parameters they
are equivalent to.
- hiseq
- --qual-calibration :hiseq
- hifi
- --qual-calibration :hifi --homopoly-fix 0.3 --low-MQ 5 --scale-MQ 1.5
--het-scale 0.37
- r10.4_sup
- --qual-calibration :r10.4_sup --homopoly-fix 0.3 --low-MQ 5 --scale-MQ 1.5
--het-scale 0.37
- r10.4_dup
- --qual-calibration :r10.4_dup --homopoly-fix 0.3 --low-MQ 5 --scale-MQ 1.5
--het-scale 0.37
- ultima
- --qual-calibration :ultima --homopoly-fix 0.3 --low-MQ 10 --scale-MQ 2
--het-scale 0.37
- -
- Create a modified FASTA reference that has a 1:1 coordinate correspondence
with the original reference used in alignment.
samtools consensus -a --show-ins no --show-del yes in.bam -o ref.fa
- -
- Create a FASTQ file for the contigs with aligned data, including
insertions.
samtools consensus -f fastq in.bam -o cons.fq
Written by James Bonfield from the Sanger Institute.
samtools(1), samtools-mpileup(1),
Samtools website: <http://www.htslib.org/>
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
|