With no options or regions specified, prints all alignments in the specified input alignment file (in SAM, BAM, or CRAM format) to standard output in SAM format (with no header).
You may specify one or more space-separated region specifications after the input filename to restrict output to only those alignments which overlap the specified region(s). Use of region specifications requires a coordinate-sorted and indexed input file (in BAM or CRAM format).
The -b, -C, -1, -u, -h, -H, and -c options change the output format from the default of headerless SAM, and the -o and -U options set the output file name(s).
The -t and -T options provide additional reference data. One of these two options is required when SAM input does not contain @SQ headers, and the -T option is required whenever writing CRAM output.
The -L, -r, -R, -q, -l, -m, -f, and -F options filter the alignments that will be included in the output to only those alignments that match certain criteria.
The -x, -B, and -s options modify the data which is contained in each alignment.
Finally, the -@ option can be used to allocate additional threads to be used for compression, and the -? option requests a long help message.
Regions can be specified as: RNAME[:STARTPOS[-ENDPOS]] and all position
coordinates are 1-based.
Important note: when multiple regions are given, some alignments may be output multiple times if they overlap more than one of the specified regions.
Examples of region specifications:
Sort alignments by leftmost coordinates, or by read name when -n is used. An appropriate @HD-SO sort order header tag will be added or an existing one updated if necessary.
The sorted output is written to standard output by default, or to the specified file (out.bam) when -o is used. This command will also create temporary files tmpprefix.%d.bam as needed when the entire alignment data cannot fit into memory (as controlled via the -m option).
Index a coordinate-sorted BAM or CRAM file for fast random access. (Note that this does not work with SAM files even if they are bgzip compressed to index such files, use tabix(1) instead.)
This index is needed when region arguments are used to limit samtools view and similar commands to particular regions of interest.
If an output filename is given, the index file will be written to out.index. Otherwise, for a CRAM file aln.cram, index file aln.cram.crai will be created; for a BAM file aln.bam, either aln.bam.bai or aln.bam.csi will be created, depending on the index format selected.
Retrieve and print stats in the index file corresponding to the input file. Before calling idxstats, the input BAM file must be indexed by samtools index.
The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. It is written to stdout.
Does a full pass through the input file to calculate and print statistics to stdout.
Provides counts for each of 13 categories based primarily on bit flags in the FLAG field. Each category in the output is broken down into QC pass and QC fail, which is presented as "#PASS + #FAIL" followed by a description of the category.
The first row of output gives the total number of reads that are QC pass and fail (according to flag bit 0x200). For example:
122 + 28 in total (QC-passed reads + QC-failed reads)
Which would indicate that there are a total of 150 reads in the input file, 122 of which are marked as QC pass and 28 of which are marked as "not passing quality controls"
Following this, additional categories are given for reads which are:
And finally, two rows are given that additionally filter on the reference name (RNAME), mate reference name (MRNM), and mapping quality (MAPQ) fields:
samtools stats collects statistics from BAM files and outputs in a text format. The output can be visualized graphically using plot-bamstats.
region.bed in1.sam|in1.bam|in1.cram[...] |
Reports read depth per genomic region, as specified in the supplied BED file.
[in1.sam|in1.bam|in1.cram [in2.sam|in2.bam|in2.cram] [...]] |
Computes the depth at each position or region.
samtools merge [-nur1f] [-h inh.sam] [-R reg] [-b <list>] <out.bam> <in1.bam> [<in2.bam> <in3.bam> ... <inN.bam>]
Merge multiple sorted alignment files, producing a single sorted output file that contains all the input records and maintains the existing sort order.
If -h is specified the @SQ headers of input files will be merged into the specified header, otherwise they will be merged into a composite header created from the input headers. If in the process of merging @SQ lines for coordinate sorted input files, a conflict arises as to the order (for example input1.bam has @SQ for a,b,c and input2.bam has b,a,c) then the resulting output file will need to be re-sorted back into coordinate order.
Unless the -c or -p flags are specified then when merging @RG and @PG records into the output header then any IDs found to be duplicates of existing IDs in the output header will have a suffix appended to them to diffientiate them from similar header records from other files and the read records will be updated to reflect this.
samtools faidx <ref.fasta> [region1 [...]]
Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create <ref.fasta>.fai on the disk. If regions are specified, the subsequences will be retrieved and printed to stdout in the FASTA format.
The input file can be compressed in the BGZF format.
The sequences in the input file should all have different names. If they do not, indexing will emit a warning about duplicate sequences and retrieval will only produce subsequences from the first sequence with the duplicated name.
Text alignment viewer (based on the ncurses library). In the viewer, press ? for help and press g to check the alignment start from a region in the format like chr10:10,000,000 or =10,000,000 when viewing the same reference sequence.
Splits a file by read group.
in.sam|in.bam|in.cram [ ... ]
Quickly check that input files appear to be intact. Checks that beginning of the file contains a valid header (all formats) containing at least one target sequence and then seeks to the end of the file and checks that an end-of-file (EOF) is present and intact (BAM only).
Data in the middle of the file is not read since that would be much more time consuming, so please note that this command will not detect internal corruption, but is useful for testing that files are not truncated before performing more intensive tasks on them.
This command will exit with a non-zero exit code if any input files dont have a valid header or are missing an EOF block. Otherwise it will exit successfully (with a zero exit code).
samtools dict <ref.fasta|ref.fasta.gz>
Create a sequence dictionary file from a fasta file.
Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.
Generate VCF, BCF or pileup for one or multiple BAM files. Alignment records are grouped by sample (SM) identifiers in @RG header lines. If sample identifiers are absent, each input file is regarded as one sample.
In the pileup format (without -u or -g), each line represents a genomic position, consisting of chromosome name, 1-based coordinate, reference base, the number of reads covering the site, read bases, base qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a > or < for a reference skip, ACGTN for a mismatch on the forward strand and acgtn for a mismatch on the reverse strand. A pattern \+[0-9]+[ACGTNacgtn]+ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern -[0-9]+[ACGTNacgtn]+ represents a deletion from the reference. The deleted bases will be presented as * in the following lines. Also at the read base column, a symbol ^ marks the start of a read. The ASCII of the character following ^ minus 33 gives the mapping quality. A symbol $ marks the end of a read segment.
samtools flags INT|STR[,...]
Convert between textual and numeric flag representation.
samtools fasta [options] in.bam
Converts a BAM or CRAM into either FASTQ or FASTA format depending on the command invoked.
in.sam|in.bam|in.cram [out.prefix] |
Shuffles and groups reads together by their names. A faster alternative to a full query name sort, collate ensures that reads of the same name are grouped together in contiguous groups, but doesnt make any guarantees about the order of read names between groups.
The output from this command should be suitable for any operation that requires all reads from the same template to be grouped together.
in.header.sam in.bam |
Replace the header in in.bam with the header in in.header.sam. This command is much faster than replacing the header with a BAM->SAM->BAM conversion.
By default this command outputs the BAM or CRAM file to standard output (stdout), but for CRAM format files it has the option to perform an in-place edit, both reading and writing to the same file. No validity checking is performed on the header, nor that it is suitable to use with the sequence data itself.
samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]
Concatenate BAMs. The sequence dictionary of each input BAM must be identical, although this command does not check this. This command uses a similar trick to reheader which enables fast BAM concatenation.
samtools rmdup [-sS] <input.srt.bam> <out.bam>
Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).
samtools addreplacerg [-r rg line | -R rg ID] [-m mode] [-l level] [-o out.bam]
Adds or replaces read group tags in a file.
samtools calmd [-Eeubr] [-C capQcoef] <aln.bam> <ref.fasta>
Generate the MD tag. If the MD tag is already present, this command will give a warning if the MD tag generated is different from the existing tag. Output SAM by default.
Calmd can also read and write CRAM files although in most cases it is pointless as CRAM recalculates MD and NM tags on the fly. The one exception to this case is where both input and output CRAM files have been / are being created with the no_ref option.
samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>
This command identifies target regions by examining the continuity of read depth, computes haploid consensus sequences of targets and outputs a SAM with each sequence corresponding to a target. When option -f is in use, BAQ will be applied. This command is only designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].
samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>
Call and phase heterozygous SNPs.
samtools depad [-SsCu1] [-T ref.fa] [-o output] <in.bam>
Converts a BAM aligned against a padded reference to a BAM aligned against the depadded reference. The padded reference may contain verbatim "*" bases in it, but "*" bases are also counted in the reference numbering. This means that a sequence base-call aligned against a reference "*" is considered to be a cigar match ("M" or "X") operator (if the base-call is "A", "C", "G" or "T"). After depadding the reference "*" bases are deleted and such aligned sequence base-calls become insertions. Similarly transformations apply for deletions and padding cigar operations.
Display a brief usage message listing the samtools commands available.
If the name of a command is also given, e.g.,
samtools help view, the detailed usage message for that particular command is displayed.
Display the version numbers and copyright information for samtools and
the important libraries used by samtools.
|Display the full samtools version number in a machine-readable format.|
Several long-options are shared between multiple samtools subcommands: --input-fmt, --input-fmt-options, --output-fmt, --output-fmt-options, and --reference. The input format is typically auto-detected so specifying the format is usually unnecessary and the option is included for completeness. Note that not all subcommands have all options. Consult the subcommand help for more details.
Format strings recognised are "sam", "bam" and "cram". They may be followed by a comma separated list of options as key or key=value. See below for examples.
The fmt-options arguments accept either a single option or option=value. Note that some options only work on some file formats and only on read or write streams. If value is unspecified for a boolean option, the value is assumed to be 1. The valid options are as follows.
nthreads=INT Specifies the number of threads to use during encoding and/or decoding. For BAM this will be encoding only. In CRAM the threads are dynamically shared between encoder and decoder. reference=fasta_file Specifies a FASTA reference file for use in CRAM encoding or decoding. It usually is not required for decoding except in the situation of the MD5 not being obtainable via the REF_PATH or REF_CACHE environment variables. decode_md=0|1 CRAM decode only; defaults to 1 (on). CRAM does not typically store MD and NM tags, preferring to generate them on the fly. This option controls this behaviour. ignore_md5=0|1 CRAM decode only; defaults to 0 (off). When enabled, md5 checksum errors on the reference sequence and block checksum errors within CRAM are ignored. Use of this option is strongly discouraged. required_fields=bit-field CRAM decode only; specifies which SAM columns need to be populated. By default all fields are used. Limiting the decode to specific columns can have significant performance gains. The bit-field is a numerical value constructed from the following table.
0x1 SAM_QNAME 0x2 SAM_FLAG 0x4 SAM_RNAME 0x8 SAM_POS 0x10 SAM_MAPQ 0x20 SAM_CIGAR 0x40 SAM_RNEXT 0x80 SAM_PNEXT 0x100 SAM_TLEN 0x200 SAM_SEQ 0x400 SAM_QUAL 0x800 SAM_AUX 0x1000 SAM_RGAUX multi_seq_per_slice=0|1 CRAM encode only; defaults to 0 (off). By default CRAM generates one container per reference sequence, except in the case of many small references (such as a fragmented assembly). version=major.minor CRAM encode only. Specifies the CRAM version number. Acceptable values are "2.1" and "3.0". seqs_per_slice=INT CRAM encode only; defaults to 10000. slices_per_container=INT CRAM encode only; defaults to 1. The effect of having multiple slices per container is to share the compression header block between multiple slices. This is unlikely to have any significant impact unless the number of sequences per slice is reduced. (Together these two options control the granularity of random access.) embed_ref=0|1 CRAM encode only; defaults to 0 (off). If 1, this will store portions of the reference sequence in each slice, permitting decode without having requiring an external copy of the reference sequence. no_ref=0|1 CRAM encode only; defaults to 0 (off). If 1, sequences will be stored verbatim with no reference encoding. This can be useful if no reference is available for the file.
samtools view --input-fmt-option decode_md=0 --output-fmt cram,version=3.0 --output-fmt-option embed_ref --output-fmt-option seqs_per_slice=2000 -o foo.cram foo.bam
The CRAM format requires use of a reference sequence for both reading and writing.
When reading a CRAM the @SQ headers are interrogated to identify the reference sequence MD5sum (M5: tag) and the local reference sequence filename (UR: tag). Note that http:// and ftp:// based URLs in the UR: field are not used, but local fasta filenames (with or without file://) can be used.
To create a CRAM the @SQ headers will also be read to identify the reference sequences, but M5: and UR: tags may not be present. In this case the -T and -t options of samtools view may be used to specify the fasta or fasta.fai filenames respectively (provided the .fasta.fai file is also backed up by a .fasta file).
The search order to obtain a reference is:
1. Use any local file specified by the command line options (eg -T). 2. Look for MD5 via REF_CACHE environment variable. 3. Look for MD5 in each element of the REF_PATH environment variable. 4. Look for a local file listed in the UR: header tag.
HTS_PATH A colon-separated list of directories in which to search for HTSlib plugins. If $HTS_PATH starts or ends with a colon or contains a double colon (::), the built-in list of directories is searched at that point in the search.
If no HTS_PATH variable is defined, the built-in list of directories specified when HTSlib was built is used, which typically includes /usr/local/libexec/htslib and similar directories.
REF_PATH A colon separated (semi-colon on Windows) list of locations in which to look for sequences identified by their MD5sums. This can be either a list of directories or URLs. Note that if a URL is included then the colon in http:// and ftp:// and the optional port number will be treated as part of the URL and not a PATH field separator. For URLs, the text %s will be replaced by the MD5sum being read.
If no REF_PATH has been specified it will default to http://www.ebi.ac.uk/ena/cram/md5/%s and if REF_CACHE is also unset, it will be set to $XDG_CACHE_HOME/hts-ref/%2s/%2s/%s. If $XDG_CACHE_HOME is unset, $HOME/.cache (or a local system temporary directory if no home directory is found) will be used similarly.
REF_CACHE This can be defined to a single directory housing a local cache of references. Upon downloading a reference it will be stored in the location pointed to by REF_CACHE. When reading a reference it will be looked for in this directory before searching REF_PATH. To avoid many files being stored in the same directory, a pathname may be constructed using %nums and %s notation, consuming num characters of the MD5sum. For example /local/ref_cache/%2s/%2s/%s will create 2 nested subdirectories with the filenames in the deepest directory being the last 28 characters of the md5sum.
The REF_CACHE directory will be searched for before attempting to load via the REF_PATH search list. If no REF_PATH is defined, both REF_PATH and REF_CACHE will be automatically set (see above), but if REF_PATH is defined and REF_CACHE not then no local cache is used.
To aid population of the REF_CACHE directory a script misc/seq_cache_populate.pl is provided in the Samtools distribution. This takes a fasta file or a directory of fasta files and generates the MD5sum named files.
o Import SAM to BAM when @SQ lines are present in the header:
samtools view -bS aln.sam > aln.bamIf @SQ lines are absent:
samtools faidx ref.fa samtools view -bt ref.fa.fai aln.sam > aln.bamwhere ref.fa.fai is generated automatically by the faidx command.
o Convert a BAM file to a CRAM file using a local reference sequence.
samtools view -C -T ref.fa aln.bam > aln.cram
o Attach the RG tag while merging sorted alignments:
perl -e print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n" > rg.txt samtools merge -rh rg.txt merged.bam ga.bam 454.bamThe value in a RG tag is determined by the file name the read is coming from. In this example, in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while reads from 454.bam will be attached RG:Z:454.
o Call SNPs and short INDELs:
samtools mpileup -uf ref.fa aln.bam | bcftools call -mv > var.raw.vcf bcftools filter -s LowQual -e %QUAL<20 || DP>100 var.raw.vcf > var.flt.vcfThe bcftools filter command marks low quality sites and sites with the read depth exceeding a limit, which should be adjusted to about twice the average read depth (bigger read depths usually indicate problematic regions which are often enriched for artefacts). One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps BWA-short but may not other mappers.
Individuals are identified from the SM tags in the @RG header lines. Individuals can be pooled in one alignment file; one individual can also be separated into multiple files. The -P option specifies that indel candidates should be collected only from read groups with the @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads sequenced by an indel-prone technology may affect the performance of indel calling.
o Generate the consensus sequence for one diploid individual:
samtools mpileup -uf ref.fa aln.bam | bcftools call -c | vcfutils.pl vcf2fq > cns.fq
o Phase one individual:
samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.outThe calmd command is used to reduce false heterozygotes around INDELs.
o Dump BAQ applied alignment for other SNP callers:
samtools calmd -bAr aln.bam > aln.baq.bamIt adds and corrects the NM and MD tags at the same time. The calmd command also comes with the -C option, the same as the one in pileup and mpileup. Apply if it helps.
o Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. o Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picards MarkDuplicates which correctly handles these cases, although a little slower.
Heng Li from the Sanger Institute wrote the original C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library. James Bonfield from the Sanger Institute developed the CRAM implementation. John Marshall and Petr Danecek contribute to the source code and various people from the 1000 Genomes Project have contributed to the SAM format specification.
bcftools(1), sam(5), tabix(1)
Samtools website: <http://www.htslib.org/>
File format specification of SAM/BAM,CRAM,VCF/BCF: <http://samtools.github.io/hts-specs>
Samtools latest source: <https://github.com/samtools/samtools>
HTSlib latest source: <https://github.com/samtools/htslib>
Bcftools website: <http://samtools.github.io/bcftools>
|samtools-1.3||SAMTOOLS (1)||15 December 2015|