|  | 
   
 |   |  |   
  
    | cmscan(1) | Infernal Manual | cmscan(1) |  
cmscan - search sequence(s) against a covariance model
  database cmscan [options] <cmdb>
    <seqfile> cmscan is used to search sequences against collections of
    covariance models. For each sequence in <seqfile>, use that
    query sequence to search the target database of CMs in <cmdb>,
    and output ranked lists of the CMs with the most significant matches to the
    sequence. The <seqfile> may contain more than one query
    sequence. It can be in FASTA format, or several other common sequence file
    formats (genbank, embl, and among others), or in alignment file formats
    (stockholm, aligned fasta, and others). See the --qformat option for
    a complete list. The <cmdb> needs to be press'ed using cmpress
    before it can be searched with cmscan. This creates four binary
    files, suffixed .i1{fimp}. Additionally, <cmdb> must have been
    calibrated for E-values with cmcalibrate before being press'ed with
    cmpress. The query <seqfile> may be '-' (a dash character), in
    which case the query sequences are read from a <stdin> pipe instead of
    from a file. The <cmdb> cannot be read from a <stdin>
    stream, because it needs to have those four auxiliary binary files generated
    by cmpress. The output format is designed to be human-readable, but is often
    so voluminous that reading it is impractical, and parsing it is a pain. The
    --tblout option saves output in a simple tabular format that is
    concise and easier to parse. The --fmt 2 option modifies the
    format of the tabular output by adding several fields, including markup of
    overlapping hits, as described in section 6 of the Infernal user guide. The
    -o option allows redirecting the main output, including throwing it
    away in /dev/null. cmscan reexamines the 5' and 3' termini of target sequences
    using specialized algorithms for detection of truncated hits, in
    which part of the 5' and/or 3' end of the actual full length homologous
    sequence is missing in the target sequence file. These types of hits will be
    most common in sequence files consisting of unassembled sequencing reads. By
    default, any 5' truncated hit is required to include the first residue of
    the target sequence it derives from in <seqfile>, and any 3'
    truncated hit is required to include the final residue of the target
    sequence it derives from. Any 5' and 3' truncated hit must include the first
    and final residue of the target sequence it derives from. The
    --anytrunc option will relax the requirements for hit inclusion of
    sequence endpoints, and truncated hits are allowed to start and stop at any
    positions of target sequences. Importantly though, with --anytrunc,
    hit E-values will be less accurate because model calibration does not
    consider the possibility of truncated hits, so use it with caution. The
    --notrunc option can be used to turn off truncated hit detection.
    --notrunc will reduce the running time of cmscan, most
    significantly for target <seqfile> files that include many
    short sequences. Truncated hit detection is automatically turned off when
    the --max, --nohmm, --qdb, or --nonbanded
    options are used because it relies on the use of an accelerated HMM banded
    alignment strategy that is turned off by any of those options. 
  -hHelp; print a brief reminder of command line usage and all available
      options.
    
  -gTurn on the glocal alignment algorithm, global with respect to the
      query model and local with respect to the target database. By default, the
      local alignment algorithm is used which is local with respect to both the
      target sequence and the model. In local mode, the alignment to span two or
      more subsequences if necessary (e.g. if the structures of the query model
      and target sequence are only partially shared), allowing certain large
      insertions and deletions in the structure to be penalized differently than
      normal indels. Local mode performs better on empirical benchmarks and is
      significantly more sensitive for remote homology detection. Empirically,
      glocal searches return many fewer hits than local searches, so glocal may
      be desired for some applications.
    
  -Z <x>Calculate E-values as if the search space size was <x>
      megabases (Mb). Without the use of this option, the search space size
      changes for each query sequence, it is defined as the length of the
      current query sequence times 2 (because both strands of the sequence will
      be searched) times the number of CMs in <cmdb>.
    
  --devhelpPrint help, as with -h , but also include expert options that are
      not displayed with -h . These expert options are not expected to be
      relevant for the vast majority of users and so are not described in the
      manual page. The only resources for understanding what they actually do
      are the brief one-line descriptions output when --devhelp is
      enabled, and the source code.
    
   
  -o <f>Direct the main human-readable output to a file <f> instead
      of the default stdout.
    
  --tblout
    <f>Save a simple tabular (space-delimited) file summarizing the hits found,
      with one data line per hit. The format of this file is described in
      section 6 of the Infernal user guide.
    
  --fmt
    <n>specify the format of the tabular output file specified with
      --tblout <f> be in format <n>. Possible
      values for <n> are 1 or 2. By default <n> is 1
      when --tblout is used without --fmt. With --fmt
      2 nine additional fields are added to the tabular output file, most of
      which pertain to the annotation of overlapping hits. See section 6 the
      Infernal user guide for a description of both formats.
    
  --accUse accessions instead of names in the main output, where available for
      profiles and/or sequences.
    
  --noaliOmit the alignment section from the main output. This can greatly reduce
      the output volume.
    
  --notextwUnlimit the length of each line in the main output. The default is a limit
      of 120 characters per line, which helps in displaying the output cleanly
      on terminals and in editors, but can truncate target profile description
      lines.
    
  --textw
    <n>Set the main output's line length limit to <n> characters per
      line. The default is 120.
    
  --verboseInclude extra search pipeline statistics in the main output, including
      filter survival statistics for truncated hit detection and number of
      envelopes discarded due to matrix size overflows.
    
   Reporting thresholds control which hits are reported in output
    files (the main output and --tblout) Hits are ranked by statistical
    significance (E-value). By default, all hits with an E-value <= 10 are
    reported. The following options allow you to change the default E-value
    reporting thresholds, or to use bit score thresholds instead. 
  -E <x>In the per-target output, report target sequences with an E-value of <=
      <x>. The default is 10.0, meaning that on average, about 10
      false positives will be reported per query, so you can see the top of the
      noise and decide for yourself if it's really noise.
    
  -T <x>Instead of thresholding per-CM output on E-value, report target sequences
      with a bit score of >= <x>.
    
   Inclusion thresholds are stricter than reporting thresholds.
    Inclusion thresholds control which hits are considered to be reliable enough
    to be included in a possible subsequent search round, or marked as
    significant ("!") as opposed to questionable ("?") in
    hit output. 
  --incE
    <x>Use an E-value of <= <x> as the hit inclusion threshold.
      The default is 0.01, meaning that on average, about 1 false positive would
      be expected in every 100 searches with different query sequences.
    
  --incT
    <x>Instead of using E-values for setting the inclusion threshold, instead use
      a bit score of >= <x> as the hit inclusion threshold. By
      default this option is unset.
    
   Curated CM databases may define specific bit score thresholds for
    each CM, superseding any thresholding based on statistical significance
    alone. To use these options, the profile must contain the appropriate
    (GA, TC, and/or NC) optional score threshold annotation; this is picked up
    by cmbuild from Stockholm format alignment files. Each thresholding
    option has a score of <x> bits, and acts as if -T
    <x> --incT <x> has been applied specifically
    using each model's curated thresholds. 
  --cut_gaUse the GA (gathering) bit scores in the model to set hit reporting and
      inclusion thresholds. GA thresholds are generally considered to be the
      reliable curated thresholds defining family membership; for example, in
      Rfam, these thresholds define what gets included in Rfam Full alignments
      based on searches with Rfam Seed models.
    
  --cut_ncUse the NC (noise cutoff) bit score thresholds in the model to set hit
      reporting and inclusion thresholds. NC thresholds are generally considered
      to be the score of the highest-scoring known false positive.
    
  --cut_tcUse the TC (trusted cutoff) bit score thresholds in the model to set hit
      reporting and inclusion thresholds. TC thresholds are generally considered
      to be the score of the lowest-scoring known true positive that is above
      all known false positives.
    
   Infernal searches are accelerated in a six-stage filter pipeline.
    The first five stages use a profile HMM to define envelopes that are passed
    to the stage six CM CYK filter. Any envelopes that survive all filters are
    assigned final scores using the the CM Inside algorithm. The profile HMM filter is built by the cmbuild program and
    is stored in <cmfile>. Each successive filter is slower than the previous one, but better
    than it at disciminating between subsequences that may contain high-scoring
    CM hits and those that do not. The first three HMM filter stages are the
    same as those used in HMMER3. Stage 1 (F1) is the local HMM SSV filter
    modified for long sequences. Stage 2 (F2) is the local HMM Viterbi filter.
    Stage 3 (F3) is the local HMM Forward filter. Each of the first three stages
    uses the profile HMM in local mode, which allows a target subsequence to
    align to any region of the HMM. Stage 4 (F4) is a glocal HMM filter, which
    requires a target subsequence to align to the full-length profile HMM. Stage
    5 (F5) is the glocal HMM envelope definition filter, which uses HMMER3's
    domain identification heursitics to define envelope boundaries. After each
    stage from 2 to 5 a bias filter step (F2b, F3b, F4b, and F5b) is used to
    remove sequences that appear to have passed the filter due to biased
    composition alone. Any envelopes that survive stages F1 through F5b are then
    passed with the local CM CYK filter. The CYK filter uses constraints (bands)
    derived from an HMM alignment of the envelope to reduce the number of
    required calculations and save time. Any envelopes that pass CYK are scored
    with the local CM Inside algorithm, again using HMM bands for
  acceleration. The default filter thresholds that define the minimum score
    required for a subsequence to survive each stage are defined based on the
    size of the search space (Z), which is defined as the length of the current
    query sequence times 2 (because both strands will be searched) times the
    number of profiles in <cmdb>. However, if either the
    -Z <x> or --FZ <x> options are used
    then the search space will be considered to be <x> for purposes
    of defining the filter thresholds. For larger databases, the filters are more strict leading to more
    acceleration but potentially a greater loss of sensitivity. The rationale is
    that for larger databases, hits must have higher scores to achieve
    statistical significance, so stricter filtering that removes lower scoring
    insignificant hits is acceptable. The P-value thresholds for all possible search space sizes and all
    filter stages are listed next. (A P-value threshold of 0.01 means that
    roughly 1% of the highest scoring nonhomologous subsequence are expected to
    pass the filter.) Z is defined as the number of nucleotides in the complete
    target sequence file times 2 because both strands will be searched with each
    model. If Z is less than 2 Mb: F1 is 0.35; F2 and F2b are off; F3, F3b,
    F4, F4b and F5 are 0.02; F6 is 0.0001. If Z is between 2 Mb and 20 Mb: F1 is 0.35; F2 and F2b are off;
    F3, F3b, F4, F4b and F5 are 0.005; F6 is 0.0001. If Z is between 20 Mb and 200 Mb: F1 is 0.35; F2 and F2b are 0.15;
    F3, F3b, F4, F4b and F5 are 0.003; F6 is 0.0001. If Z is between 200 Mb and 2 Gb: F1 is 0.15; F2 and F2b are 0.15;
    F3, F3b, F4, F4b, F5, and F5b are 0.0008; and F6 is 0.0001. If Z is between 2 Gb and 20 Gb: F1 is 0.15; F2 and F2b are 0.15;
    F3, F3b, F4, F4b, F5, and F5b are 0.0002; and F6 is 0.0001. If Z is more than 20 Gb: F1 is 0.06; F2 and F2b are 0.02; F3, F3b,
    F4, F4b, F5, and F5b are 0.0002; and F6 is 0.0001. These thresholds were chosen based on performance on an internal
    benchmark testing many different possible settings. There are five options for controlling the general filtering
    level. These options are, in order from least strict (slowest but most
    sensitive) to most strict (fastest but least sensitive): --max,
    --nohmm, --mid, --default, (this is the default
    setting) --rfam. and --hmmonly. With --default the
    filter thresholds will be database-size dependent. See the explanation of
    each of these individual options below for more information. Additionally, an expert user can precisely control each filter
    stage score threshold with the --F1, --F1b, --F2,
    --F2b, --F3, --F3b, --F4, --F4b,
    --F5, --F5b, and --F6 options. As well as turn each
    stage on or off with the --noF1, --doF1b, --noF2,
    --noF2b, --noF3, --noF3b, --noF4,
    --noF4b, --noF5, and --noF6. options. These options are
    only displayed if the --devhelp option is used to keep the number of
    displayed options with -h reasonable, and because they are only
    expected to be useful to a small minority of users. As a special case, for any models in <cmfile> which
    have zero basepairs, profile HMM searches are run instead of CM searches.
    HMM algorithms are more efficient than CM algorithms, and the benefit of CM
    algorithms is lost for models with no secondary structure (zero basepairs).
    These profile HMM searches will run significantly faster than the CM
    searches. You can force HMM-only searches with the --hmmonly option.
    For more information on HMM-only searches see the user guide. 
  --maxTurn off all filters, and run non-banded Inside on every full-length
      target sequence. This increases sensitivity somewhat, at an extremely
      large cost in speed.
    
  --nohmmTurn off all HMM filter stages (F1 through F5b). The CYK filter, using
      QDBs, will be run on every full-length target sequence and will enforce a
      P-value threshold of 0.0001. Each subsequence that survives CYK will be
      passed to Inside, which will also use QDBs (but a looser set). This
      increases sensitivity somewhat, at a very large cost in speed.
    
  --midTurn off the HMM SSV and Viterbi filter stages (F1 through F2b). Set
      remaining HMM filter thresholds (F3 through F5b) to 0.02 by default, but
      changeable to <x> with --Fmid <x>
      sequence. This may increase sensitivity, at a significant cost in speed.
    
  --defaultUse the default filtering strategy. This option is on by default. The
      filter thresholds are determined based on the database size.
    
  --rfamUse a strict filtering strategy devised for large databases (more than 20
      Gb). This will accelerate the search at a potential cost to sensitivity.
    
  --hmmonlyOnly use the filter profile HMM for searches, do not use the CM. Only
      filter stages F1 through F3 will be executed, using strict P-value
      thresholds (0.02 for F1, 0.001 for F2 and 0.00001 for F3). Additionally a
      bias composition filter is used after the F1 stage (with P=0.02 survival
      threshold). Any hit that survives all stages and has an HMM E-value or bit
      score above the reporting threshold will be output. The user can change
      the HMM-only filter thresholds and options with --hmmF1,
      --hmmF2, --hmmF3, --hmmnobias, --hmmnonull2,
      and --hmmmax. By default, searches for any model with zero
      basepairs will be run in HMM-only mode. This can be turned off, forcing CM
      searches for these models with the --nohmmonly option.
    
  --FZ
    <x>Set filter thresholds as the defaults used if the database were
      <x> megabases (Mb). If used with <x> greater
      than 20000 (20 Gb) this option has the same effect as --rfam.
    
  --Fmid
    <x>With the --mid option set the HMM filter thresholds (F3 through
      F5b) to <x>. By default, <x> is 0.02.
    
   
  --notruncTurn off truncated hit detection.
    
  --anytruncAllow truncated hits to begin and end at any position in a target
      sequence. By default, 5' truncated hits must include the first residue of
      their target sequence and 3' truncated hits must include the final residue
      of their target sequence. With this option you may observe fewer full
      length hits that extend to the beginning and end of the query CM. As of
      version 1.1.5, truncated hits that end at sequence terminii with a lower
      score penalty than internally truncated hits are also considered (these
      were not considered in 1.1x versions prior to 1.1.5). To reproduce the
      behavior of this option from v1.1.4, use the --inttrunc option
      instead.
    
  --nonull3Turn off the null3 CM score corrections for biased composition. This
      correction is not used during the HMM filter stages.
    
  --mxsize
    <x>Set the maximum allowable CM DP matrix size to <x> megabytes.
      By default this size is 128 Mb. This should be large enough for the vast
      majority of searches, especially with smaller models. If cmscan
      encounters an envelope in the CYK or Inside stage that requires a larger
      matrix, the envelope will be discounted from consideration. This behavior
      is like an additional filter that prevents expensive (slow) CM DP
      calculations, but at a potential cost to sensitivity. Note that if
      cmscan is being run in <n> multiple threads on a
      multicore machine then each thread may have an allocated matrix of up to
      size <x> Mb at any given time.
    
  --smxsize
    <x>Set the maximum allowable CM search DP matrix size to <x>
      megabytes. By default this size is 128 Mb. This option is only relevant if
      the CM will not use HMM banded matrices, i.e. if the --max,
      --nohmm, --qdb, --fqdb, --nonbanded, or
      --fnonbanded options are also used. Note that if cmsearch is
      being run in <n> multiple threads on a multicore machine then
      each thread may have an allocated matrix of up to size <x> Mb
      at any given time.
    
  --cykUse the CYK algorithm, not Inside, to determine the final score of all
      hits.
    
  --acykUse the CYK algorithm to align hits. By default, the Durbin/Holmes optimal
      accuracy algorithm is used, which finds the alignment that maximizes the
      expected accuracy of all aligned residues.
    
  --wcx
    <x>For each CM, set the W parameter, the expected maximum length of a hit, to
      <x> times the consensus length of the model. By default, the
      W parameter is read from the CM file and was calculated based on the
      transition probabilities of the model by cmbuild. You can find out
      what the default W is for a model using cmstat. This option should
      be used with caution as it impacts the filtering pipeline at several
      different stages in nonobvious ways. It is only recommended for expert
      users searching for hits that are much longer than any of the homologs
      used to build the model in cmbuild, e.g. ones with large introns or
      other large insertions. It cannot be used in combination with the
      --nohmm, --fqdb or --qdb options because in those
      cases W is limited by query-dependent bands.
    
  --toponlyOnly search the top (Watson) strand of target sequences in
      <seqfile>. By default, both strands are searched. This will
      halve the search space size (Z).
    
  --bottomonlyOnly search the bottom (Crick) strand of target sequences in
      <seqfile>. By default, both strands are searched. This will
      halve the search space size (Z).
    
  --qformat
    <s>Assert that the query sequence database file is in format
      <s>. Accepted formats include fasta, embl,
      genbank, ddbj, stockholm, pfam, a2m,
      afa, clustal, and phylip The default is to autodetect
      the format of the file.
    
  --glist
    <f>Configure a subset of models from <cmfile> in glocal
      alignment mode, instead of local mode, namely the models listed in file
      <f>. Configure all other models (those not listed in
      <f>) in local mode. This option is incompatible with
      -g. File <f> must list valid names of models from
      <cmfile>, each separated by any whitespace character (e.g. a
      newline character).
    
  --clanin
    <f>Read clan information on the models in <cmfile> from file
      <f>. Not all models in <cmfile> need to be a
      member of a clan. This option must be used in combination with
      --fmt 2 and --tblout because clan annotation is only
      output in format 2 of the tabular output file. See section 9 of the
      Infernal user guide for specifications on the format of the clan input
      file <f>.
    
  --oclanOnly mark overlaps between models in the same clan. This option must be
      used in combination with --fmt 2 , --tblout and
      --clanin because clan annotation is only output in format 2 of the
      tabular output file, and clan information can only be input using the
      --clanin option.
    
  --oskipOmit any hit h from the tabular output file that satisfies the following:
      another hit h2 overlaps with h and the E-value of h2 is lower than that of
      h, and h2 is itself not omitted. Hit h will not appear in the tabular
      output file, although it will still exist in the standard output. This
      option must be used in combination with --fmt 2
      --tblout because overlap annotation is only output in format 2 of
      the tabular output file. When used in combination with --oclan only
      hits h that satisfy the following are omitted: another hit h2 overlaps
      with h, the E-value of h2 is lower than that of h, and both h and h2 are
      hits to models that are in the same clan.
    
  --cpu
    <n>Set the number of parallel worker threads to <n>. On
      multicore machines, the default is 4. You can also control this number by
      setting an environment variable, INFERNAL_NCPU. There is also a
      master thread, so the actual number of threads that Infernal spawns is
      <n>+1. This option is not available if Infernal was compiled
      with POSIX threads support turned off.
    
  --stallFor debugging the MPI master/worker version: pause after start, to enable
      the developer to attach debuggers to the running master and worker(s)
      processes. Send SIGCONT signal to release the pause. (Under gdb: (gdb)
      signal SIGCONT) (Only available if optional MPI support was enabled at
      compile-time.)
    
  --mpiRun in MPI master/worker mode, using mpirun. (Only available if
      optional MPI support was enabled at compile-time.)
    
   See infernal(1) for a master man page with a list of all
    the individual man pages for programs in the Infernal package. For complete documentation, see the user guide that came with your
    Infernal distribution (Userguide.pdf); or see the Infernal web page
    (http://eddylab.org/infernal/). Copyright (C) 2023 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license. For additional information on copyright and licensing, see the
    file called COPYRIGHT in your Infernal source distribution, or see the
    Infernal web page (http://eddylab.org/infernal/). 
  Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
 |