 |
|
| |
MLOCARNA(1) |
User Contributed Perl Documentation |
MLOCARNA(1) |
MLocARNA - multiple alignment of RNA
mlocarna [options] <fasta file>
MLocarna computes a multiple sequence-structure alignment
of RNA sequences. The structure of these sequences does not have to be known
but is inferred based in simultaneous alignment and folding.
Generally, mlocarna takes multiple sequences as input, given in a
fasta file. The fasta file can be extended to specify structure and anchor
constraints that respectively control the possible foldings and possible
alignments. The main outcome is a multiple alignment together with a
consensus structure.
Technically, mlocarna works as front end to the pairwise alignment
tools locarna, locarna_p, and sparse (and even carna), which are employed to
construct the multiple alignment progressively.
Going beyond the basic progressive alignment scheme, Mlocarna
implements probabilistic consistency transformation and iterative alignment,
which are available in probabilistic mode. Moreover, the LocARNA package
provides an alternative multiple alignment tool "locarnate", which
generates alignments based on T-Coffee using (non-probabilistic) consistency
transformation.
- --configure=file
- Load a parameter set from a configuration file of options and option value
pairs. This enables specifying (sets of) default parameters for mlocarna,
which can still be modified by other options to mlocarna. Command line
arguments always take precedence over this configuration. Options are
specified as single entries per line; option value pairs, like option:
value. Whitespace and '#'-prefixed comments are ignored.
By default, mlocarna performs progressive alignment, where the
progressive alignment steps are computed by the pairwise aligner locarna
based on sequences and dot plots (RNAfold -p); subsequently, partial
alignmetns and their consensus dot plots.
- --probabilistic
- In probabilistic mode, mlocarna scores alignments using match
probabilities that are computed by a partition function approach [tech.
details: the probability computation is implemented in locarna_p; the
probability-based scoring is performed by locarna in mea mode]. This
enables mlocarna to consistency-transform the probabilities (option
--consistency-transform) and to compute reliabilities. The tool
reliability-profile.pl is provided to visualize reliability profiles.
Reliabilities can also be used for iterating the alignment with reliably
aligned base pairs as structural constraints (option
--it-reliable-structure).
- --sparse
- Apply the sparsified alignment algorithm SPARSE for all pairwise
alignments (instead of the default pairwise aligner locarna). SPARSE
supports stronger sparsification for faster alignment computation and
increases the structure prediction capabilities over locarna.
- --tgtdir
- Target directory. All output files are written to this directory. Per
default the target directory is generated from the input filename by
replacing suffix fa by (or appending) out.
- -v, --verbose
- Turn on verbose ouput. Shows progress of computation of all-2-all pairwise
alignments for guide tree computation; shows intermediary alignments
during the progressive alignment computation.
- --moreverbose
- Be even more verbose: additionally shows parameters for the pairwise
aligner; moreover, the calls and output of the RNA base pair probability
computations as well as the pairwise aligner during progressive
alignment.
- -q, --quiet
- Be quiet.
- --keep-sequence-order
- Preserve sequence order of the input in the final alignment. Affects
output to stdout and results/result.aln.
- --stockholm
- Write STOCKHOLM files of all final and intermediate alignments (in
addition to CLUSTALW files).
- --consensus-structure
- Type of consensus structures written to stockholm output (and screen in
verbose modes) [alifold|mea|none] (default: none). This includes
intermediate alignments of the progressive multiple alignment. If not
explicitly specified othwise, the option alifold-consensus-dp implicitly
sets this to alifold. Note that the alifold consenus of the final
alignment is computed and printed, regardless of this option.
- -w, --width=columns
(120)
- Output width for sequences in clustal-like and stockholm output; note that
the clustalw standard format requires 60 or less.
- --write-structure
- Write guidance structure in output to stdout. This provides some insight
into the influence of structure into the generated pairwise alignments.
The guidance structure shows the base pairs 'predicted' by each pairwise
locarna (or sparse) alignment. These structures should not be mistaken as
predicted consensus structures of multiple alignments. Consensus
structures can be more adequately derived from the multiple alignment. For
this reason, mlocarna reports the consensus structure by RNAalifold.
- --free-endgaps
- Allow free endgaps. (Corresponds to pairwise locarna option --free-endgaps
"++++".)
- --free-endgaps-3
- Allow free endgaps 3'.
- --free-endgaps-5
- Allow free endgaps 5'.
- --sequ-local=bool
(false)
- Turns on/off sequence locality [def=off]. Sequence locality refers to the
usual form of local alignment. If on, mlocarna bases all calculations on
local pairwise alignments, which determin the best alignments of
subsequences (disregarding dissimilar starts and ends). Note that truely
local structure alignments as well as local multiple alignments are still
a matter of research; so don't expect perfect results in all
instances.
- --struct-local=bool
(false)
- Turns on/off structure locality [def=off]. Structural locality enables
skipping entire substructures in alignments. In pairwise alignments, this
allows one exclusion of some subsequence in each loop; thus, guaranteeing
that the (structure locally) aligned parts of the sequences are always
connected w.r.t. the predicted structure but not necessarily consecutive
in the sequence. Structure locality does not imply sequence locality, but
rather the two concepts are orthogonal.
- --penalized=score
- Variant of sequence local alignment (cf. --sequ-local), where the
specified penalty score is subtracted for each base in the local
alignment. [Experimental]
- --indel=score
(-150)
- Score of each single base insertion or deletion.
- --indel-opening=score
(-750)
- Score of opening an insertion or deletion, i.e. score for a consecutive
run of deletions or insertions. Indel opening score and indel score define
the affine scoring of gaps.
- -m, --match=score
(50)
- Score of a base match (unless ribosum-based scoring)
- -M, --mismatch=score
(0)
- Score of a base mismatch (unless ribosum-based scoring)
- --use-ribosum=bool
(true)
- Use ribosum scores for scoring base matches and base pair matches; note
that tau=0 suppresses any effect on the latter.
- --ribosum-file=file
- File specifying the Ribosum base and base-pair similarities. [default: use
RIBOSUM85_60 without requiring a Ribosum file.]
- -s, --struct-weight=score
(200)
- Maximum weight of one predicted arc, aka base pair. Note that this means
that the maximum weight of an arc match is twice as high. The maximum
weight is assigned to base pairs with (almost) probability 1 in the dot
plot; less probable base pairs receive gradually degrading scores. The
struct-weight factor balances the score contribution from structure to the
score contribution from base similarity scores (e.g. ribosum scores).
- -e,
--exp-prob=prob
- Expected probability of a base pair.
- -t, --tau=factor
(0)
- Tau factor in percent. The tau factor controls the contribution of
sequence-dependent scores to the score of arc matches.
- -E,
--exclusion=<score> (0)
- Weight of an exclusion, i.e. an ommitted subsequence in a loop, which
applies only to structural local alignment.
- --stacking
- Use stacking terms. In this case, stacked arcs are scored based on
conditional probabilities (conditioned by their stacked inner arc) rather
than unconditioned base pair probabilities. [Experimental]
- --new-stacking
- Use new stacking terms; cf. --stacking. These terms directly award bonuses
to stacking. [Experimental]
Several parameters are available to speed up the pairwise
alignment computations heuristically. Choosing these parameters reasonably
is necessary to achieve good trade-off between speed and accuracy,
especially for large alignment instances.
- -p, --min-prob=probability
(0.001)
- Minimum base pair / arc probability. Arc with lower probability in the
input RNA structure ensembles are ignored.
- -P,
--tree-min-prob=probability
- Minimal prob for constructing guide tree. This probability can be set
separately for the all-2-all comparison for constructing the guide tree
and the progressive/iterative alignment steps.
- --max-bps-length-ratio=factor
(0.0)
- Maximal ratio of the number of base pairs divided by sequence length
(default: no effect)
- -D,
--max-diff-am=difference
- Maximal difference for lengths of matched arcs. Two arcs that have a
higher difference of their lengths are ignored. This speeds up the
alignment, since less arc comparisons (i.e. less DP matrices) have to be
computed. [def: off/-1]
- -d,
--max-diff=difference
- Maximal difference of the positions of any two bases that are considered
to be aligned. Bases with higher difference are generally not aligned.
This allows banding of the DP matrices and thus can result in high speed
ups. Note that the semantic changes in the context of a reference
alignment specified with max-diff-aln. Then, the difference to the
reference alignment is restricted. [def: off/-1]
- --max-diff-at-am=difference
- Same restriction as max-diff but only at the ends of arcs in arc matches.
[def: off/-1]
- --min-trace-probability=probability
- Minimal sequence alignment probability of potential traces
(probability-based sequence alignment envelope) [default=1e-4, moderate
filter].
- --max-diff-aln=file
- Computes "realignment" in the environment of the given reference
alignment (file in clustalw format) by constraining the maximum difference
to this reference (controlled by --max-diff). The input sequences (and
their names) have to be identical to these alignment sequences; however
the alignment is allowed to contain extra sequences, which are ignored. In
combination with option --realign, the reference alignment is taken from
the (main) input file. In this case, the 'file' argument should be '.',
but is ignored (with warning) otherwise.
- --max-diff-relax
- Relax deviation constraints (cf. --max-diff-aln) in multiple aligmnent.
This option is useful if the default strategy for realignment fails.
- -a,
--min-am-prob=probability (0.001)
- Minimum arc-match probability (filters output of locarna-p)
- -b,
--min-bm-prob=probability (0.001)
- Minimum base-match probability (filters output of locarna-p)
- --pw-aligner
- Utilize the given tool for computing pairwise alignments
(def=locarna).
- --pw-aligner-p=tool
- Utilize the given tool for computing partition function pairwise
alignments (def=locarna_p).
- --pw-aligner-options
- Additional option string for the pairwise alignment tool
(def="").
- --pw-aligner-p-options
- Additional option string for the partition function pairwise alignment
tool (def="").
- --treefile=file
- File with guide tree in NEWICK format. The given tree is used as guide
tree for the progressive alignment. This saves the calculation of pairwise
all-vs-all similarities and construction of the guide tree.
- --similarity-matrix=file
- File with similarity matrix. The similarities in the matrix are used to
construct the guide tree for the progressive alignment. This saves the
calculation of pairwise all-vs-all similarities.
- --score-lists
- Construct the guide tree from pairwise scores in files scores* in the
subdirectory scores of the target directory. The scores are typically
precomputed, possibly in a distributed way, using
--compute-pairwise-scores.
- --compute-pairwise-scores=k/N
- Compute only the pairwise alignments for the guide tree construction.
Write scores to the file $tgtdir/scores/scores-$k
and terminate. By computing only the k-th fraction of N parts, the option
supports distributing the computation of the alignments. Before computing
the pairwise scores, the dot plot files should be precomputed using
--only-dps. (see also: --score-lists)
- --graphkernel
- Use the graphkernel for constructing the guide tree.
- --svmsgdnspdk=program
- Specify the svmsgdnspdk program (potentially including path). Default: use
"svmsgdnspdk" in path.
- --fasta2shrep=program
- Program "fasta2shrep" for generating graphs from the input
sequences for use with the graph kernel guide tree generation (potentially
including path). Default: use "fasta2shrep_gspan.pl" in
path.
- --fasta2shrep-options=argument-string
- Command line arguments for fasta2shrep. Default: "-wins 200 -shift 50
-stack -t 3 -M 3".
- --alifold-consensus-dp
- Employs RNAalifold -p for generating consensus dotplot after each
progressive alignment step. This replaces the default consensus dotplot
computation, which averages over the input dot plots. This method should
be used with care in combination with structural constraints, since it
ignores them for all but the pairwise alignments of single sequences.
Furthermore, note that it does not support --stacking or
--new-stacking.
- --max-alignment-size=size
- Limit the maximum number of sequences that are aligned together by
progressive alignment. This can be used to save unnecessary computations,
when producing a clustering of the input RNAs rather than constructing a
single multiple alignment. [default: no limit].
- --local-progressive
- Align only the subalignment of locally aligned subsequences in subsequent
steps of the progressive multiple alignment. Note: this is only effective
if local alignment is turned on. (Default for sequence local alignment;
turn off by --global-progressive)
- --global-progressive
- Use alignments including "locality gaps" in subsequent steps of
the progressive multiple alignment. Note: this is only effective if local
alignment is turned on. (Opposite of --local-progressive)
- --consistency-transformation
- Apply probabilistic consistency transformation (only possible in
probabilistic mode).
- --iterate
- Refine iteratively after progressive alignment. Currently, iterative
refinement optimizes the SCI or RELIABILITY (not the locarna score)!
Iterative refinement realigns all binary splits along the guide tree.
- --iterations=number
- Refine iteratively for given number of iterations (or stop at
convergence).
- --it-reliable-structure=number
- Iterate alignment <num> times with reliable structure. This works
only in probabilistic mode, when reliabilities can be computed.
- --pf-only-basematch-probs
- Use only base match probabilities (no base pair match probabilities).
- --extended-pf
-
Use extended precision for partition function values. This increases
run-time and space (less than 2x), however enables handling
significantly larger instances.
- --quad-pf
-
Use quad precision for partition function values. Even more precision
than extended pf, but usually much slower (overrides extended-pf).
- --pf-scale=<scale>
- Scale of partition function; use for avoiding overflow in larger
instances.
- --fast-mea
- Compute base match probabilities using Gotoh PF-algorithm.
- --mea-alpha
- Weight of unpaired probabilities in fast mea mode.
- --mea-beta
- Weight of base pair match contribution in probabilistic mode.
- --mea-gamma
- Reserved parameter for fast-mea mode.
- --mea-gapcost
- Turn on gap penalties in probabilistic/mea mode (default: off).
- --write-probs /
--no-write-probs
- Write / don't write probabilities (of base matches and arc matches) to the
target directory. Override by single options --(no-)write-bm-probs and
--(no-)write-am-probs is possible. Use this to make the probability files
available for post-processing. (default: don't write).
- --write-bm-probs
/ --no-write-bm-probs
- Don't write / Write base match probabilities to files in target dir
(default: don't write).
- --write-am-probs
/ --no-write-am-probs
- Don't write / Write arc match probabilities to files in target dir
(default: don't write).
- --realign
- Realignment mode. In this mode, the input must be in clustal format and is
interpreted as alignment of the input sequences; the sequences are
obtained by removing all gap symbols. Moreover, the given alignment is set
as reference alignment for --max-diff-aln. Structure and anchor
constraints can be specified as consensus constraints in the input;
constraints are specified as 'alignment strings' with names '#A1', '#S',
or '#FS' for anchor, structure, or fixed structure constraints,
respectively. Characters in the '#A1' anchor specification other than '-'
and '.' constrain the aligned residues in the respective column to remain
aligned (blanks are disallowed; annotations '#A2', '#A3', ... are
ignored). The consensus structure constraint is equivalent to constraining
each single sequence by the projection of the consensus constraint to the
sequence (removing all base pairs with at least one gapped end).
- --dp-cache=directory
- Use directory <dir> as cache for dot plot or pp files (useful for
avoiding multiple computation).
- --only-dps
- Compute only the pair probability files / dot plots, don't align (useful
for filling the dp-cache).
- --evaluate=file
- Evaluate the given multiple alignment (clustalw aln format, or use
--eval-fasta). This requires that probailities are already computed
(mlocarna --probabilistic) and present in the target directory
(--tgtdir).
- --eval-fasta
- Assume that alignment for evaluation (cf. --evaluate) is in fasta
format.
- --anchor-constraints=<file>
- Read anchor constraints from bed format specification.
Anchor constraints in four-column bed format specify positions
of named anchor regions per sequence. The 'contig' names have to
correspond to the fasta input sequence names. Anchor names must be
unique per sequence and regions of the same name for different sequences
must have the same length. This constrains the alignment to align all
regions of the same name.
The specification of anchors via this option removes all
anchor definitions that may be given directly in the fasta input
file!
- --ignore-constraints
- Ignore all constraints (anchor and structure constraints) even if
given.
- --noLP /
--LP
- Disallow/Allow lonely pairs (default: Disallow).
- --maxBPspan
- Limit maximum span of base pairs (default off).
- --relaxed-anchors
- Relax semantics of anchor constraints (default off, meaning 'strict'
semantics). For lexicographically ordered anchors, where each sequence is
annotated with exactly the same names, both semantics are equivalent;
thus, in this common case, the subtle differences can be ignored. In
strict semantics, anchor names must be ordered lexicographically and can
only be aligned in this order. In relaxed semantics, the only requirement
is that equal anchor names are matched. Consequently, anchor names that
don't occur in all sequences could be overwritten (if two names are
assigned to the same position) or even introduce inconsistencies.
- --plfold-span=span
- Use RNAplfold with span.
- --plfold-winsize=ws
- Use RNAplfold with window of size ws (default=2*span).
- --rnafold-parameter=<file>
- Parameter file for RNAfold (RNAfold's -P option)
- --rnafold-temperature=<temp>
- Temperature for RNAfold (RNAfold's -T option)
- --skip-pp
- Skip computation of pair probs if the probabilities are already existing.
Non-existing ones are still computed.
- --no-bpp-precomputation
- Switch off precomputation of base pair probabilties. Overwrite potentially
existing input files. (compare skip-pp). For use with special pairwise
aligners (e.g. locarna_n) that recompute the base pair probabilities at
each invokation.
- --in-loop-probabilities
- Turn on precomputation of in loop probabilties. For use with special
pairwise aligners (e.g. locarna_n) that use such probabilities.
- --threads,
--cpus=number
- Use the given number of threads for computing pair probabilities and
all-2-all alignments in parallel (multicore/processor support).
Be aware: mlocarna seems not to scale well for more than a few
threads (often only 2 or 3). Using more threads is often detrimental,
since it strongly increases memory consumption due to the current perl
threading implementation. This unfortunate behavior seems hard to
improve without major rewrite of the software.
- --help
- Brief help message
- --man
- Full documentation
The sequences are given in input file <file> in mfasta
format. All results are written to a target directory <dir>. If the
file tree is given, contained tree (in NEWICK-tree format) is used as guide
tree for the progressive alignment. The final results are collected in
<tgtdir>/results. The final multiple alignment is
<tgtdir>/results/result.aln.
[Note that the LocARNA distribution provides files of the
following and other examples in Data/Examples.]
Sequences are typically given in plain fasta format like
example.fa
----------------------------------------
>fruA
CCUCGAGGGGAACCCGAAAGGGACCCGAGAGG
>fdhA
CGCCACCCUGCGAACCCAAUAUAAAAUAAUACAAGGGAGCAGGUGGCG
>vhuU
AGCUCACAACCGAACCCAUUUGGGAGGUUGUGAGCU
----------------------------------------
To align these sequences, simply call
mlocarna example.fa
Usually, it makes sense to set additional options; this is either
done on the command line or via configuration files. A reasonable small
configuration for global alignment of large instances would be
short-example.cfg
----------------------------------------
max-diff-am: 25
max-diff: 60
min-prob: 0.01
plfold-span: 100
indel: -50
indel-open: -750
threads: 8 # <- adapt to your hardware
alifold-consensus-dp
----------------------------------------
To use it, call
mlocarna --config short-example.cfg example.fa
which is equivalent to
mlocarna --max-diff-am 25 --max-diff 60 --min-prob 0.01 \
--indel -50 --indel-open -750 \
--plfold-span 100 --threads 8 --alifold-consensus-dp \
example.fa
For probabilistic alignment with consistency transformation,
call
mlocarna --probabilistic --consistency-transform example.fa
In both cases, mlocarna writes the main results to stdout and more
detailed results to the target directory example.out. The results directory
is overwritten if it exists already. To avoid this, one can specify the
target directory (--tgtdir).
Mlocarna supports structure constraints for folding and anchor
constraints for alignment. Both types of constraints can be specified in
extension of the standard fasta format via 'constraint lines'. Fasta-ish
input with constraints looks like this
example-w-constraints.fa
----------------------------------------
>A
GACCCUGGGAACAUUAACUACUCUCGUUGGUGAUAAGGAACA
..((.(....xxxxxx...................))).xxx #S
..........000000.......................111 #1
..........123456.......................123 #2
>B
ACGGAGGGAAAGCAAGCCUUCUGCGACA
.(((....xxxxxx.......))).xxx #S
........000000...........111 #1
........123456...........123 #2
----------------------------------------
The same anchor constraints (like by the lines tagged #1, #2) can
alternatively be specified in bed format by the entries
example-anchors.bed
----------------------------------------
A 10 16 first_box
B 8 14 first_box
A 39 42 ACA-box
B 25 28 ACA-box
----------------------------------------
where anchor regions (boxes) have arbitrary but matching names and
contig/sequence names correspond to the sequence names of the fasta(-like)
input.
Given, e.g.
example-wo-anchors.fa
----------------------------------------
>A
GACCCUGGGAACAUUAACUACUCUCGUUGGUGAUAAGGAACA
..((.(....xxxxxx...................))).xxx #S
>B
ACGGAGGGAAAGCAAGCCUUCUGCGACA
.(((....xxxxxx.......))).xxx #S
----------------------------------------
one calls
mlocarna --anchor-constraints example-anchors.bed example-wo-anchors.fa
In realignment mode (option --realign), mlocarna is called with an
input alignment in clustal format, e.g.
mlocarna --realign example-realign.aln
This allows to define constraints as 'consensus constraints' in
the input, e.g.
example-realign.aln
----------------------------------------
CLUSTAL W
fruA --CCUCGAGGGGAACCCGAA-------------AGGGACCCGAGAGG--
vhuU AGCUCACAACCGAACCCAUU-------------UGGGAGGUUGUGAGCU
fdhA CGCCACCCUGCGAACCCAAUAUAAAAUAAUACAAGGGAGCAG-GUGGCG
#A1 ..*...........CCC.............................5..
#S ((((((.((((...(((.................))).)))).))))))
----------------------------------------
Note that anchor names are arbitrary and the consensus structure
is 'projected' to the single sequences. Moreover, the input alignment can be
used as reference for fast limited realignment, e.g. call to realign in
distance 5 of the reference alignment:
mlocarna --realign example-realign.aln --max-diff 5 --max-diff-aln .
Sebastian Will Christina Otto (ExpaRNA-P, sparsification classes
for ExpaRNA-P and SPARSE) Milad Miladi (SPARSE)
For download and online information, see
<https://github.com/s-will/LocARNA> and
<http://www.bioinf.uni-freiburg.de/Software/LocARNA>.
Latest releases are available as source code on Github at
<https://github.com/s-will/LocARNA/releases>.
Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler,
and Rolf Backofen. Inferring non-coding RNA families and classes by means of
genome-scale structure-based clustering. PLOS Computational Biology, 3 no. 4
pp. e65, 2007. doi:10.1371/journal.pcbi.0030065
Sebastian Will, Tejal Joshi, Ivo L. Hofacker, Peter F. Stadler,
and Rolf Backofen. LocARNA-P: Accurate boundary prediction and improved
detection of structural RNAs. RNA, 18 no. 5 pp. 900-914, 2012.
doi:10.1261/rna.029041.111
Sebastian Will, Michael Yu, and Bonnie Berger. Structure-based
Whole Genome Realignment Reveals Many Novel Non-coding RNAs. Genome
Research, no. 23 pp. 1018-1027, 2013. doi:10.1101/gr.137091.111
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
|