 |
|
| |
fasta36/ssearch36/[t]fast[x,y]36/lalign36 1(local) |
|
fasta36/ssearch36/[t]fast[x,y]36/lalign36 1(local) |
fasta36 - scan a protein or DNA sequence library for similar
sequences
fastx36 - compare a DNA sequence to a protein sequence
database, comparing the translated DNA sequence in forward and reverse
frames.
tfastx36 - compare a protein sequence to a DNA sequence
database, calculating similarities with frameshifts to the forward and
reverse orientations.
fasty36 - compare a DNA sequence to a protein sequence
database, comparing the translated DNA sequence in forward and reverse
frames.
tfasty36 - compare a protein sequence to a DNA sequence
database, calculating similarities with frameshifts to the forward and
reverse orientations.
fasts36 - compare unordered peptides to a protein sequence
database
fastm36 - compare ordered peptides (or short DNA sequences) to a
protein (DNA) sequence database
tfasts36 - compare unordered peptides to a translated DNA sequence
database
fastf36 - compare mixed peptides to a protein sequence
database
tfastf36 - compare mixed peptides to a translated DNA sequence
database
ssearch36 - compare a protein or DNA sequence to a sequence
database using the Smith-Waterman algorithm.
ggsearch36 - compare a protein or DNA sequence to a sequence
database using a global alignment (Needleman-Wunsch)
glsearch36 - compare a protein or DNA sequence to a sequence
database with alignments that are global in the query and local in the
database sequence (global-local).
lalign36 - produce multiple non-overlapping alignments for protein
and DNA sequences using the Huang and Miller sim algorithm for the
Waterman-Eggert algorithm.
prss36, prfx36 - discontinued; all the FASTA programs will
estimate statistical significance using 500 shuffled sequence scores if two
sequences are compared.
Release 3.6 of the FASTA package provides a modular set of
sequence comparison programs that can run on conventional single processor
computers or in parallel on multiprocessor computers. More than a dozen
programs - fasta36, fastx36/tfastx36, fasty36/tfasty36, fasts36/tfasts36,
fastm36, fastf36/tfastf36, ssearch36, ggsearch36, and glsearch36 - are
currently available.
All the comparison programs share a set of basic command line
options; additional options are available for individual comparison
functions.
Threaded versions of the FASTA programs (built by default under
Unix/Linux/MacOX) run in parallel on modern Linux and Unix multi-core or
multi-processor computers. Accelerated versions of the Smith-Waterman
algorithm are available for architectures with the Intel SSE2 or Altivec
PowerPC architectures, which can speed-up Smith-Waterman calculations 10 -
20-fold.
In addition to the serial and threaded versions of the FASTA
programs, MPI parallel versions are available as fasta36_mpi, ssearch36_mpi,
fastx36_mpi, etc. The MPI parallel versions use the same command line
options as the serial and threaded versions.
By default, the FASTA programs are no longer interactive; they are
run from the command line by specifying the program, query.file, and
library.file. Program options must preceed the query.file and
library.file arguments:
fasta36 -option1 -option2 -option3 query.file library.file >
fasta.output
The "classic" interactive mode, which prompts for a
query.file and library.file, is available with the -I option. Typing a
program name without any arguments (ssearch36) provides a short
help message; program_name -help provides a complete set of program
options.
Program options MUST preceed the query.file and
library.file arguments.
The default scoring matrix and gap penalties used by each of the
programs have been selected for high sensitivity searches with the various
algorithms. The default program behavior can be modified by providing
command line options before the query.file and library.file
arguments. Command line options can also be used in interactive mode.
Command line arguments come in several classes.
(1) Commands that specify the comparison type. FASTA, FASTS,
FASTM, SSEARCH, GGSEARCH, and GLSEARCH can compare either protein or DNA
sequences, and attempt to recognize the comparison type by looking the
residue composition. -n, -p specify DNA (nucleotide) or protein
comparison, respectively. -U specifies RNA comparison.
(2) Commands that limit the set of sequences compared: -1,
-3, -M.
(3) Commands that modify the scoring parameters: -f gap-open
penaltyP, -g gap-extend penalty, -j inter-codon frame-shift, within-codon
frameshift, -s scoring-matrix, -r match/mismatch score, -x
X:X score.
(4) Commands that modify the algorithm (mostly FASTA and
[T]FASTX/Y): -c, -w, -y, -o. The -S can be used to
ignore lower-case (low complexity) residues during the initial score
calculation.
(5) Commands that modify the output: -A, -b number, -C
width, -d number, -L, -m 0-11,B, -w line-width, -W
context-width, -o offset1,ofset2
(6) Commands that affect statistical estimates: -Z, -k.
- -1
- Sort by "init1" score (obsolete)
- -3
- ([t]fast[x,y] only) use only forward frame translations
- -a
- Displays the full length (included unaligned regions) of both sequences
with fasta36, ssearch36, glsearch36, and fasts36.
- -A (fasta36 only) For DNA:DNA, force
Smith-Waterman alignment for
- output. Smith-Waterman is the default for FASTA protein alignment and
[t]fast[x,y], but not for DNA comparisons with FASTA. For protein:protein,
use band-alignment algorithm.
- -b #
- number of best scores/descriptions to show (must be < expectation
cutoff if -E is given). By default, this option is no longer used; all
scores better than the expectation (E()) cutoff are listed. To guarantee
the display of # descriptions/scores, use -b =#, i.e. -b =100
ensures that 100 descriptions/scores will be displayed. To guarantee at
least 1 description, but possibly many more (limited by -E e_cut), use
-b >1.
- -c "E-opt E-join"
- threshold for gap joining (E-join) and band optimization (E-opt) in FASTA
and [T]FASTX/Y. FASTA36 now uses BLAST-like statistical thresholds for
joining and band optimization. The default statistical thresholds for
protein and translated comparisons are E-opt=0.2, E-join=0.5; for DNA,
E-join = 0.1 and E-opt= 0.02. The actual number of joins and optimizations
is reported after the E-join and E-opt scoring parameters. Statistical
thresholds improves search speed 2 - 3X, and provides much more accurate
statistical estimates for matrices other than BLOSUM50. The
"classic" joining/optimization thresholds that were the default
in fasta35 and earlier programs are available using -c O (upper case O),
possibly followed a value > 1.0 to set the optcut optimization
threshold.
- -C #
- length of name abbreviation in alignments, default = 6. Must be less than
20.
- -d #
- number of best alignments to show ( must be < expectation (-E) cutoff
and <= the -b description limit).
- -D
- turn on debugging mode. Enables checks on sequence alphabet that cause
problems with tfastx36, tfasty36 (only available after compile time
option). Also preserves temp files with -e expand_script.sh option.
- -e expand_script.sh
- Run a script to expand the set of sequences displayed/aligned based on the
results of the initial search. When the -e expand_script.sh option is
used, after the initial scan and statistics calculation, but before the
"Best scores" are shown, expand_script.sh with a single
argument, the name of a file that contains the accession information (the
text on the fasta description line between > and the first space) and
the E()-value for the sequence. expand_script.sh then uses this
information to send a library of additional sequences to stdout. These
additional sequences are included in the list of high-scoring sequences
(if their scores are significant) and aligned. The additional sequences do
not change the statistics or database size.
- -E e_cut e_cut_r
- expectation value upper limit for score and alignment display. Defaults
are 10.0 for FASTA36 and SSEARCH36 protein searches, 5.0 for translated
DNA/protein comparisons, and 2.0 for DNA/DNA searches. FASTA version 36
now reports additional alignments between the query and the library
sequence, the second value sets the threshold for the subsequent
alignments. If not given, the threshold is e_cut/10.0. If given and value
> 1.0, e_cut_r = e_cut / value; for value < 1.0, e_cut_r = value; If
e_cut_r < 0, then the additional alignment option is disabled.
- -f #
- penalty for opening a gap.
- -F #
- expectation value lower limit for score and alignment display. -F 1e-6
prevents library sequences with E()-values lower than 1e-6 from being
displayed. This allows the use to focus on more distant
relationships.
- -g #
- penalty for additional residues in a gap
- -h
- Show short help message.
- -help
- Show long help message, with all options.
- -H
- show histogram (with fasta-36.3.4, the histogram is not shown by
default).
- -i
- (fasta DNA, [t]fastx[x,y]) compare against only the reverse complement of
the library sequence.
- -I
- interactive mode; prompt for query filename, library.
- -j # #
- ([t]fast[x,y] only) penalty for a frameshift between two codons, ([t]fasty
only) penalty for a frameshift within a codon.
- -J
- (lalign36 only) show identity alignment.
- -k
- specify number of shuffles for statistical parameter estimation
(default=500).
- -l str
- specify FASTLIBS file
- -L
- report long sequence description in alignments (up to 200
characters).
- -m
0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file" alignment
display
- options. -m 0, 1, 2, 3 display different types of alignments. -m
4 provides an alignment "map" on the query. -m 5
combines the alignment map and a -m 0 alignment. -m 6
provides an HTML output.
- -m 8 seeks to mimic BLAST -m 8 tabular output. Only query and
- library sequence names, and identity, mismatch, starts/stops, E()-values,
and bit scores are displayed. -m 8C mimics BLAST tabular format with
comment lines. -m 8 formats do not show alignments.
- -m 9 does not change the alignment output, but provides
- alignment coordinate and percent identity information with the best scores
report. -m 9c adds encoded alignment information to the -m
9; -m 9C adds encoded alignment information as a CIGAR
formatted string. To accomodate frameshifts, the CIGAR format has
been supplemented with F (forward) and R (reverse). -m 9i
provides only percent identity and alignment length information with the
best scores. With current versions of the FASTA programs, independent
-m options can be combined; e.g. -m 1 -m 9c -m 6.
- -m 11 provides lav format
output from lalign36. It does not
- currently affect other alignment algorithms. The lav2ps and
lav2svg programs can be used to convert lav format output
to postscript/SVG alignment "dot-plots".
- -m B provides BLAST-like
alignments. Alignments are labeled as
- "Query" and "Sbjct", with coordinates on the same line
as the sequences, and BLAST-like symbols for matches and
mismatches. -m BB extends BLAST similarity to all the output,
providing an output that closely mimics BLAST output.
- -m "F# out.file"
allows one search to write different alignment
- formats to different files. The 'F' indicates separate file output; the
'#' is the output format (1-6,8,9,10,11,B,BB, multiple compatible formats
can be combined separated by commas -',').
- -M #-#
- molecular weight (residue) cutoffs. -M "101-200" examines only
library sequences that are 101-200 residues long.
- -n
- force query to nucleotide sequence
- -N #
- break long library sequences into blocks of # residues. Useful for
bacterial genomes, which have only one sequence entry. -N 2000 works well
for well for bacterial genomes. (This option was required when FASTA only
provided one alignment between the query and library sequence. It is not
as useful, now that multiple alignments are available.)
- -o "#,#"
- offsets query, library sequence for numbering alignments
- -O file
- send output to file.
- -p
- force query to protein alphabet.
- -P pssm_file
- (ssearch36, ggsearch36, glsearch36 only). Provide blastpgp checkpoint file
as the PSSM for searching. Two PSSM file formats are available, which must
be provided with the filename. 'pssm_file 0' uses a binary format that is
machine specific; 'pssm_file 1' uses the "blastpgp -u 1 -C
pssm_file" ASN.1 binary format (preferred).
- -q/-Q
- quiet option; do not prompt for input (on by default)
- -r "+n/-m"
- (DNA only) values for match/mismatch for DNA comparisons. +n is
used for the maximum positive value and -m is used for the maximum
negative value. Values between max and min, are rescaled, but residue
pairs having the value -1 continue to be -1.
- -R file
- save all scores to statistics file (previously -r file)
- -s name
- specify substitution matrix. BLOSUM50 is used by default; PAM250, PAM120,
and BLOSUM62 can be specified by setting -s P120, P250, or BL62.
Additional scoring matrices include: BLOSUM80 (BL80), and MDM10, MDM20,
MDM40 (Jones, Taylor, and Thornton, 1992 CABIOS 8:275-282; specified as -s
MD10, -s MD20, -s MD40), OPTIMA5 (-s OPT5, Kann and Goldstein, (2002)
Proteins 48:367-376), and VTML160 (-s VT160, Mueller and Vingron (2002) J.
Comp. Biol. 19:8-13). Each scoring matrix has associated default gap
penalties. The BLOSUM62 scoring matrix and -11/-1 gap penalties can be
specified with -s BP62.
- Alternatively, a BLASTP format scoring matrix file can be specified, e.g.
-s matrix.filename. DNA scoring matrices can also be specified with the
"-r" option.
- With fasta36.3, variable scoring matrices can be specified by preceeding
the scoring matrix abbreviation with '?', e.g. -s '?BP62'. Variable
scoring matrices allow the FASTA programs to choose an alternative scoring
matrix with higher information content (bit score/position) when short
queries are used. For example, a 90 nucleotide FASTX query can produce
only a 30 amino-acid alignment, so a scoring matrix with 1.33
bits/position is required to produce a 40 bit score. The FASTA programs
include BLOSUM50 (0.49 bits/pos) and BLOSUM62 (0.58 bits/pos) but can
range to MD10 (3.44 bits/position). The variable scoring matrix option
searches down the list of scoring matrices to find one with information
content high enough to produce a 40 bit alignment score.
- -S
- treat lower case letters in the query or database as low complexity
regions that are equivalent to 'X' during the initial database scan, but
are treated as normal residues for the final alignment display.
Statistical estimates are based on the 'X'ed out sequence used during the
initial search. Protein databases (and query sequences) can be generated
in the appropriate format using John Wooton's "pseg" program,
available from ftp://ftp.ncbi.nih.gov/pub/seg/pseg. Once you have compiled
the "pseg" program, use the command:
- pseg database.fasta -z 1 -q > database.lc_seg
- -t #
- Translation table - [t]fastx36 and [t]fasty36 support the BLAST tranlation
tables. See
http://www.ncbi.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c/.
- -T #
- (threaded, parallel only) number of threads or workers to use (on
Linux/MacOS/Unix, the default is to use as many processors as are
available; on Windows systems, 2 processors are used).
- -U
- Do RNA sequence comparisons: treat 'T' as 'U', allow G:U base pairs (by
scoring "G-A" and "T-C" as score(G:G)-3). Search only
one strand.
- -V "?$%*"
- Allow special annotation characters in query sequence. These characters
will be displayed in the alignments on the coordinate number line.
- -w # line width for similarity
score, sequence alignment, output.
- -W # context length (default is 1/2
of line width -w) for alignment,
- like fasta and ssearch, that provide additional sequence context.
- -X extended options. Less used
options. Other options include
- -XB, -XM4G, -Xo, -Xx, and -Xy; see
fasta_guide.pdf.
- -z 1, 2, 3, 4, 5, 6
- Specify the statistical calculation. Default is -z 1 for local similarity
searches, which uses regression against the length of the library
sequence. -z -1 disables statistics. -z 0 estimates significance without
normalizing for sequence length. -z 2 provides maximum likelihood
estimates for lambda and K, censoring the 250 lowest and 250 highest
scores. -z 3 uses Altschul and Gish's statistical estimates for specific
protein BLOSUM scoring matrices and gap penalties. -z 4,5: an alternate
regression method. -z 6 uses a composition based maximum likelihood
estimate based on the method of Mott (1992) Bull. Math. Biol.
54:59-75.
- -z 11,12,14,15,16
- compute the regression against scores of randomly shuffled copies of the
library sequences. Twice as many comparisons are performed, but accurate
estimates can be generated from databases of related sequences. -z 11 uses
the -z 1 regression strategy, etc.
- -z 21, 22, 24, 25, 26
- compute two E()-values. The standard (library-based) E()-value is
calculated in the standard way (-z 1, 2, etc), but a second E2() value is
calculated by shuffling the high-scoring sequences (those with E()-values
less than the threshold). For "average" composition proteins,
these two estimates will be similar (though the best-shuffle estimates are
always more conservative). For biased composition proteins, the two
estimates may differ by 100-fold or more. A second -z option, e.g. -z
"21 2", specifies the estimation method for the best-shuffle
E2()-values. Best-shuffle E2()-values approximate the estimates given by
PRSS (or in a pairwise SSEARCH).
- -Z db_size
- Set the apparent database size used for expectation value calculations
(used for protein/protein FASTA and SSEARCH, and for [T]FASTX/Y).
The FASTA programs can accept a query sequence from the unix
"stdin" data stream. This makes it much easier to use fasta36 and
its relatives as part of a WWW page. To indicate that stdin is to be used,
use "@" as the query sequence file name. "@" can also be
used to specify a subset of the query sequence to be used, e.g:
cat query.aa | fasta36 @:50-150 s
would search the 's' database with residues 50-150 of query.aa.
FASTA cannot automatically detect the sequence type (protein vs DNA) when
"stdin" is used and assumes protein comparisons by default; the
'-n' option is required for DNA for STDIN queries.
- FASTLIBS
- location of library choice file (-l FASTLIBS)
- SRCH_URL1,
SRCH_URL2
- format strings used to define options to re-search the database.
- REF_URL
- the format string used to define the option to lookup the library sequence
in entrez, or some other database.
Bill Pearson
wrp@virginia.EDU
Version: $ Id: $ Revision: $Revision: 210 $
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
|