NAME

MLocARNA - multiple alignment of RNA

SYNOPSIS

mlocarna [options] <fasta file>

DESCRIPTION

MLocarna computes a multiple sequence-structure alignment of RNA sequences. The structure of these sequences does not have to be known but is inferred based in simultaneous alignment and folding.

Generally, mlocarna takes multiple sequences as input, given in a fasta file. The fasta file can be extended to specify structure and anchor constraints that respectively control the possible foldings and possible alignments. The main outcome is a multiple alignment together with a consensus structure.

Technically, mlocarna works as front end to the pairwise alignment tools locarna, locarna_p, and sparse (and even carna), which are employed to construct the multiple alignment progressively.

Going beyond the basic progressive alignment scheme, Mlocarna implements probabilistic consistency transformation and iterative alignment, which are available in probabilistic mode. Moreover, the LocARNA package provides an alternative multiple alignment tool "locarnate", which generates alignments based on T-Coffee using (non-probabilistic) consistency transformation.

OPTIONS

Load configurations from file

--configure=file: Load a parameter set from a configuration file of options and option value pairs. This enables specifying (sets of) default parameters for mlocarna, which can still be modified by other options to mlocarna. Command line arguments always take precedence over this configuration. Options are specified as single entries per line; option value pairs, like option: value. Whitespace and '#'-prefixed comments are ignored.

Major alignment modes

By default, mlocarna performs progressive alignment, where the progressive alignment steps are computed by the pairwise aligner locarna based on sequences and dot plots (RNAfold -p); subsequently, partial alignmetns and their consensus dot plots.

--probabilistic: In probabilistic mode, mlocarna scores alignments using match probabilities that are computed by a partition function approach [tech. details: the probability computation is implemented in locarna_p; the probability-based scoring is performed by locarna in mea mode]. This enables mlocarna to consistency-transform the probabilities (option --consistency-transform) and to compute reliabilities. The tool reliability-profile.pl is provided to visualize reliability profiles. Reliabilities can also be used for iterating the alignment with reliably aligned base pairs as structural constraints (option --it-reliable-structure).
--sparse: Apply the sparsified alignment algorithm SPARSE for all pairwise alignments (instead of the default pairwise aligner locarna). SPARSE supports stronger sparsification for faster alignment computation and increases the structure prediction capabilities over locarna.

Controlling Output

--tgtdir: Target directory. All output files are written to this directory. Per default the target directory is generated from the input filename by replacing suffix fa by (or appending) out.
-v, --verbose: Turn on verbose ouput. Shows progress of computation of all-2-all pairwise alignments for guide tree computation; shows intermediary alignments during the progressive alignment computation.
--moreverbose: Be even more verbose: additionally shows parameters for the pairwise aligner; moreover, the calls and output of the RNA base pair probability computations as well as the pairwise aligner during progressive alignment.
-q, --quiet: Be quiet.
--keep-sequence-order: Preserve sequence order of the input in the final alignment. Affects output to stdout and results/result.aln.
--stockholm: Write STOCKHOLM files of all final and intermediate alignments (in addition to CLUSTALW files).
--consensus-structure: Type of consensus structures written to stockholm output (and screen in verbose modes) [alifold|mea|none] (default: none). This includes intermediate alignments of the progressive multiple alignment. If not explicitly specified othwise, the option alifold-consensus-dp implicitly sets this to alifold. Note that the alifold consenus of the final alignment is computed and printed, regardless of this option.
-w, --width=columns (120): Output width for sequences in clustal-like and stockholm output; note that the clustalw standard format requires 60 or less.
--write-structure: Write guidance structure in output to stdout. This provides some insight into the influence of structure into the generated pairwise alignments. The guidance structure shows the base pairs 'predicted' by each pairwise locarna (or sparse) alignment. These structures should not be mistaken as predicted consensus structures of multiple alignments. Consensus structures can be more adequately derived from the multiple alignment. For this reason, mlocarna reports the consensus structure by RNAalifold.

Locality

--free-endgaps: Allow free endgaps. (Corresponds to pairwise locarna option --free-endgaps "++++".)
--free-endgaps-3: Allow free endgaps 3'.
--free-endgaps-5: Allow free endgaps 5'.
--sequ-local=bool (false): Turns on/off sequence locality [def=off]. Sequence locality refers to the usual form of local alignment. If on, mlocarna bases all calculations on local pairwise alignments, which determin the best alignments of subsequences (disregarding dissimilar starts and ends). Note that truely local structure alignments as well as local multiple alignments are still a matter of research; so don't expect perfect results in all instances.
--struct-local=bool (false): Turns on/off structure locality [def=off]. Structural locality enables skipping entire substructures in alignments. In pairwise alignments, this allows one exclusion of some subsequence in each loop; thus, guaranteeing that the (structure locally) aligned parts of the sequences are always connected w.r.t. the predicted structure but not necessarily consecutive in the sequence. Structure locality does not imply sequence locality, but rather the two concepts are orthogonal.
--penalized=score: Variant of sequence local alignment (cf. --sequ-local), where the specified penalty score is subtracted for each base in the local alignment. [Experimental]

Pairwise alignment and scoring

--indel=score (-150): Score of each single base insertion or deletion.
--indel-opening=score (-750): Score of opening an insertion or deletion, i.e. score for a consecutive run of deletions or insertions. Indel opening score and indel score define the affine scoring of gaps.
-m, --match=score (50): Score of a base match (unless ribosum-based scoring)
-M, --mismatch=score (0): Score of a base mismatch (unless ribosum-based scoring)
--use-ribosum=bool (true): Use ribosum scores for scoring base matches and base pair matches; note that tau=0 suppresses any effect on the latter.
--ribosum-file=file: File specifying the Ribosum base and base-pair similarities. [default: use RIBOSUM85_60 without requiring a Ribosum file.]
-s, --struct-weight=score (200): Maximum weight of one predicted arc, aka base pair. Note that this means that the maximum weight of an arc match is twice as high. The maximum weight is assigned to base pairs with (almost) probability 1 in the dot plot; less probable base pairs receive gradually degrading scores. The struct-weight factor balances the score contribution from structure to the score contribution from base similarity scores (e.g. ribosum scores).
-e, --exp-prob=prob: Expected probability of a base pair.
-t, --tau=factor (0): Tau factor in percent. The tau factor controls the contribution of sequence-dependent scores to the score of arc matches.
-E, --exclusion=<score> (0): Weight of an exclusion, i.e. an ommitted subsequence in a loop, which applies only to structural local alignment.
--stacking: Use stacking terms. In this case, stacked arcs are scored based on conditional probabilities (conditioned by their stacked inner arc) rather than unconditioned base pair probabilities. [Experimental]
--new-stacking: Use new stacking terms; cf. --stacking. These terms directly award bonuses to stacking. [Experimental]

Alignment heuristics

Several parameters are available to speed up the pairwise alignment computations heuristically. Choosing these parameters reasonably is necessary to achieve good trade-off between speed and accuracy, especially for large alignment instances.

-p, --min-prob=probability (0.001): Minimum base pair / arc probability. Arc with lower probability in the input RNA structure ensembles are ignored.
-P, --tree-min-prob=probability: Minimal prob for constructing guide tree. This probability can be set separately for the all-2-all comparison for constructing the guide tree and the progressive/iterative alignment steps.
--max-bps-length-ratio=factor (0.0): Maximal ratio of the number of base pairs divided by sequence length (default: no effect)
-D, --max-diff-am=difference: Maximal difference for lengths of matched arcs. Two arcs that have a higher difference of their lengths are ignored. This speeds up the alignment, since less arc comparisons (i.e. less DP matrices) have to be computed. [def: off/-1]
-d, --max-diff=difference: Maximal difference of the positions of any two bases that are considered to be aligned. Bases with higher difference are generally not aligned. This allows banding of the DP matrices and thus can result in high speed ups. Note that the semantic changes in the context of a reference alignment specified with max-diff-aln. Then, the difference to the reference alignment is restricted. [def: off/-1]
--max-diff-at-am=difference: Same restriction as max-diff but only at the ends of arcs in arc matches. [def: off/-1]
--min-trace-probability=probability: Minimal sequence alignment probability of potential traces (probability-based sequence alignment envelope) [default=1e-4, moderate filter].
--max-diff-aln=file: Computes "realignment" in the environment of the given reference alignment (file in clustalw format) by constraining the maximum difference to this reference (controlled by --max-diff). The input sequences (and their names) have to be identical to these alignment sequences; however the alignment is allowed to contain extra sequences, which are ignored. In combination with option --realign, the reference alignment is taken from the (main) input file. In this case, the 'file' argument should be '.', but is ignored (with warning) otherwise.
--max-diff-relax: Relax deviation constraints (cf. --max-diff-aln) in multiple aligmnent. This option is useful if the default strategy for realignment fails.
-a, --min-am-prob=probability (0.001): Minimum arc-match probability (filters output of locarna-p)
-b, --min-bm-prob=probability (0.001): Minimum base-match probability (filters output of locarna-p)

Low-level selection of pairwise alignment tools and options

--pw-aligner: Utilize the given tool for computing pairwise alignments (def=locarna).
--pw-aligner-p=tool: Utilize the given tool for computing partition function pairwise alignments (def=locarna_p).
--pw-aligner-options: Additional option string for the pairwise alignment tool (def="").
--pw-aligner-p-options: Additional option string for the partition function pairwise alignment tool (def="").

Controlling the guide tree construction

--treefile=file: File with guide tree in NEWICK format. The given tree is used as guide tree for the progressive alignment. This saves the calculation of pairwise all-vs-all similarities and construction of the guide tree.
--similarity-matrix=file: File with similarity matrix. The similarities in the matrix are used to construct the guide tree for the progressive alignment. This saves the calculation of pairwise all-vs-all similarities.
--score-lists: Construct the guide tree from pairwise scores in files scores* in the subdirectory scores of the target directory. The scores are typically precomputed, possibly in a distributed way, using --compute-pairwise-scores.
--compute-pairwise-scores=k/N: Compute only the pairwise alignments for the guide tree construction. Write scores to the file $tgtdir/scores/scores-$k and terminate. By computing only the k-th fraction of N parts, the option supports distributing the computation of the alignments. Before computing the pairwise scores, the dot plot files should be precomputed using --only-dps. (see also: --score-lists)
--graphkernel: Use the graphkernel for constructing the guide tree.
--svmsgdnspdk=program: Specify the svmsgdnspdk program (potentially including path). Default: use "svmsgdnspdk" in path.
--fasta2shrep=program: Program "fasta2shrep" for generating graphs from the input sequences for use with the graph kernel guide tree generation (potentially including path). Default: use "fasta2shrep_gspan.pl" in path.
--fasta2shrep-options=argument-string: Command line arguments for fasta2shrep. Default: "-wins 200 -shift 50 -stack -t 3 -M 3".

Controlling multiple alignment construction

--alifold-consensus-dp: Employs RNAalifold -p for generating consensus dotplot after each progressive alignment step. This replaces the default consensus dotplot computation, which averages over the input dot plots. This method should be used with care in combination with structural constraints, since it ignores them for all but the pairwise alignments of single sequences. Furthermore, note that it does not support --stacking or --new-stacking.
--max-alignment-size=size: Limit the maximum number of sequences that are aligned together by progressive alignment. This can be used to save unnecessary computations, when producing a clustering of the input RNAs rather than constructing a single multiple alignment. [default: no limit].
--local-progressive: Align only the subalignment of locally aligned subsequences in subsequent steps of the progressive multiple alignment. Note: this is only effective if local alignment is turned on. (Default for sequence local alignment; turn off by --global-progressive)
--global-progressive: Use alignments including "locality gaps" in subsequent steps of the progressive multiple alignment. Note: this is only effective if local alignment is turned on. (Opposite of --local-progressive)
--consistency-transformation: Apply probabilistic consistency transformation (only possible in probabilistic mode).
--iterate: Refine iteratively after progressive alignment. Currently, iterative refinement optimizes the SCI or RELIABILITY (not the locarna score)! Iterative refinement realigns all binary splits along the guide tree.
--iterations=number: Refine iteratively for given number of iterations (or stop at convergence).
--it-reliable-structure=number: Iterate alignment <num> times with reliable structure. This works only in probabilistic mode, when reliabilities can be computed.

Further options for probabilistic mode

--pf-only-basematch-probs

Use only base match probabilities (no base pair match probabilities).

--extended-pf

 Use extended precision for partition function values. This increases
 run-time and space (less than 2x), however enables handling
 significantly larger instances.

--quad-pf

 Use quad precision for partition function values. Even more precision
 than extended pf, but usually much slower (overrides extended-pf).

--pf-scale=<scale>

Scale of partition function; use for avoiding overflow in larger instances.

--fast-mea

Compute base match probabilities using Gotoh PF-algorithm.

--mea-alpha

Weight of unpaired probabilities in fast mea mode.

--mea-beta

Weight of base pair match contribution in probabilistic mode.

--mea-gamma

Reserved parameter for fast-mea mode.

--mea-gapcost

Turn on gap penalties in probabilistic/mea mode (default: off).

--write-probs / --no-write-probs

Write / don't write probabilities (of base matches and arc matches) to the target directory. Override by single options --(no-)write-bm-probs and --(no-)write-am-probs is possible. Use this to make the probability files available for post-processing. (default: don't write).

--write-bm-probs / --no-write-bm-probs

Don't write / Write base match probabilities to files in target dir (default: don't write).

--write-am-probs / --no-write-am-probs

Don't write / Write arc match probabilities to files in target dir (default: don't write).

Miscallaneous modes of operation

--realign: Realignment mode. In this mode, the input must be in clustal format and is interpreted as alignment of the input sequences; the sequences are obtained by removing all gap symbols. Moreover, the given alignment is set as reference alignment for --max-diff-aln. Structure and anchor constraints can be specified as consensus constraints in the input; constraints are specified as 'alignment strings' with names '#A1', '#S', or '#FS' for anchor, structure, or fixed structure constraints, respectively. Characters in the '#A1' anchor specification other than '-' and '.' constrain the aligned residues in the respective column to remain aligned (blanks are disallowed; annotations '#A2', '#A3', ... are ignored). The consensus structure constraint is equivalent to constraining each single sequence by the projection of the consensus constraint to the sequence (removing all base pairs with at least one gapped end).
--dp-cache=directory: Use directory <dir> as cache for dot plot or pp files (useful for avoiding multiple computation).
--only-dps: Compute only the pair probability files / dot plots, don't align (useful for filling the dp-cache).
--evaluate=file: Evaluate the given multiple alignment (clustalw aln format, or use --eval-fasta). This requires that probailities are already computed (mlocarna --probabilistic) and present in the target directory (--tgtdir).
--eval-fasta: Assume that alignment for evaluation (cf. --evaluate) is in fasta format.

Constraints

--anchor-constraints=<file>

Read anchor constraints from bed format specification.

Anchor constraints in four-column bed format specify positions of named anchor regions per sequence. The 'contig' names have to correspond to the fasta input sequence names. Anchor names must be unique per sequence and regions of the same name for different sequences must have the same length. This constrains the alignment to align all regions of the same name.

The specification of anchors via this option removes all anchor definitions that may be given directly in the fasta input file!

--ignore-constraints

Ignore all constraints (anchor and structure constraints) even if given.

Rna folding (RNAfold/RNAplfold)

--noLP / --LP: Disallow/Allow lonely pairs (default: Disallow).
--maxBPspan: Limit maximum span of base pairs (default off).
--relaxed-anchors: Relax semantics of anchor constraints (default off, meaning 'strict' semantics). For lexicographically ordered anchors, where each sequence is annotated with exactly the same names, both semantics are equivalent; thus, in this common case, the subtle differences can be ignored. In strict semantics, anchor names must be ordered lexicographically and can only be aligned in this order. In relaxed semantics, the only requirement is that equal anchor names are matched. Consequently, anchor names that don't occur in all sequences could be overwritten (if two names are assigned to the same position) or even introduce inconsistencies.
--plfold-span=span: Use RNAplfold with span.
--plfold-winsize=ws: Use RNAplfold with window of size ws (default=2*span).
--rnafold-parameter=<file>: Parameter file for RNAfold (RNAfold's -P option)
--rnafold-temperature=<temp>: Temperature for RNAfold (RNAfold's -T option)
--skip-pp: Skip computation of pair probs if the probabilities are already existing. Non-existing ones are still computed.
--no-bpp-precomputation: Switch off precomputation of base pair probabilties. Overwrite potentially existing input files. (compare skip-pp). For use with special pairwise aligners (e.g. locarna_n) that recompute the base pair probabilities at each invokation.
--in-loop-probabilities: Turn on precomputation of in loop probabilties. For use with special pairwise aligners (e.g. locarna_n) that use such probabilities.

Multithreading

--threads, --cpus=number: Use the given number of threads for computing pair probabilities and all-2-all alignments in parallel (multicore/processor support).
Be aware: mlocarna seems not to scale well for more than a few threads (often only 2 or 3). Using more threads is often detrimental, since it strongly increases memory consumption due to the current perl threading implementation. This unfortunate behavior seems hard to improve without major rewrite of the software.

Getting Help

--help: Brief help message
--man: Full documentation

The sequences are given in input file <file> in mfasta format. All results are written to a target directory <dir>. If the file tree is given, contained tree (in NEWICK-tree format) is used as guide tree for the progressive alignment. The final results are collected in <tgtdir>/results. The final multiple alignment is <tgtdir>/results/result.aln.

EXAMPLES

Calling mlocarna

[Note that the LocARNA distribution provides files of the following and other examples in Data/Examples.]

Sequences are typically given in plain fasta format like



    example.fa
    ----------------------------------------
    >fruA
    CCUCGAGGGGAACCCGAAAGGGACCCGAGAGG
    >fdhA
    CGCCACCCUGCGAACCCAAUAUAAAAUAAUACAAGGGAGCAGGUGGCG
    >vhuU
    AGCUCACAACCGAACCCAUUUGGGAGGUUGUGAGCU
    ----------------------------------------

To align these sequences, simply call

  mlocarna example.fa

Usually, it makes sense to set additional options; this is either done on the command line or via configuration files. A reasonable small configuration for global alignment of large instances would be



    short-example.cfg
    ----------------------------------------
    max-diff-am: 25
    max-diff:    60
    min-prob:    0.01
    plfold-span: 100
    indel:       -50
    indel-open:  -750
    threads:     8   # <- adapt to your hardware
    alifold-consensus-dp
    ----------------------------------------

To use it, call

    mlocarna --config short-example.cfg example.fa

which is equivalent to



    mlocarna --max-diff-am 25 --max-diff 60 --min-prob 0.01 \
             --indel -50 --indel-open -750 \
             --plfold-span 100 --threads 8 --alifold-consensus-dp \
             example.fa

For probabilistic alignment with consistency transformation, call

  mlocarna --probabilistic --consistency-transform example.fa

In both cases, mlocarna writes the main results to stdout and more detailed results to the target directory example.out. The results directory is overwritten if it exists already. To avoid this, one can specify the target directory (--tgtdir).

Use of constraints

Mlocarna supports structure constraints for folding and anchor constraints for alignment. Both types of constraints can be specified in extension of the standard fasta format via 'constraint lines'. Fasta-ish input with constraints looks like this



    example-w-constraints.fa
    ----------------------------------------
    >A
    GACCCUGGGAACAUUAACUACUCUCGUUGGUGAUAAGGAACA
    ..((.(....xxxxxx...................))).xxx #S
    ..........000000.......................111 #1
    ..........123456.......................123 #2
    >B
    ACGGAGGGAAAGCAAGCCUUCUGCGACA
    .(((....xxxxxx.......))).xxx #S
    ........000000...........111 #1
    ........123456...........123 #2
    ----------------------------------------

The same anchor constraints (like by the lines tagged #1, #2) can alternatively be specified in bed format by the entries



    example-anchors.bed
    ----------------------------------------
    A   10      16      first_box
    B   8       14      first_box
    A   39      42      ACA-box
    B   25      28      ACA-box
    ----------------------------------------

where anchor regions (boxes) have arbitrary but matching names and contig/sequence names correspond to the sequence names of the fasta(-like) input.

Given, e.g.



    example-wo-anchors.fa
    ----------------------------------------
    >A
    GACCCUGGGAACAUUAACUACUCUCGUUGGUGAUAAGGAACA
    ..((.(....xxxxxx...................))).xxx #S
    >B
    ACGGAGGGAAAGCAAGCCUUCUGCGACA
    .(((....xxxxxx.......))).xxx #S
    ----------------------------------------

one calls

  mlocarna --anchor-constraints example-anchors.bed  example-wo-anchors.fa

Realignment

In realignment mode (option --realign), mlocarna is called with an input alignment in clustal format, e.g.

  mlocarna --realign example-realign.aln

This allows to define constraints as 'consensus constraints' in the input, e.g.



    example-realign.aln
    ----------------------------------------
    CLUSTAL W
    fruA               --CCUCGAGGGGAACCCGAA-------------AGGGACCCGAGAGG--
    vhuU               AGCUCACAACCGAACCCAUU-------------UGGGAGGUUGUGAGCU
    fdhA               CGCCACCCUGCGAACCCAAUAUAAAAUAAUACAAGGGAGCAG-GUGGCG
    #A1                ..*...........CCC.............................5..
    #S                 ((((((.((((...(((.................))).)))).))))))
    ----------------------------------------

Note that anchor names are arbitrary and the consensus structure is 'projected' to the single sequences. Moreover, the input alignment can be used as reference for fast limited realignment, e.g. call to realign in distance 5 of the reference alignment:

  mlocarna --realign example-realign.aln --max-diff 5 --max-diff-aln .

AUTHORS

Sebastian Will Christina Otto (ExpaRNA-P, sparsification classes for ExpaRNA-P and SPARSE) Milad Miladi (SPARSE)

ONLINE INFORMATION

For download and online information, see <https://github.com/s-will/LocARNA> and <http://www.bioinf.uni-freiburg.de/Software/LocARNA>.

Latest releases are available as source code on Github at <https://github.com/s-will/LocARNA/releases>.

REFERENCES

Sebastian Will, Kristin Reiche, Ivo L. Hofacker, Peter F. Stadler, and Rolf Backofen. Inferring non-coding RNA families and classes by means of genome-scale structure-based clustering. PLOS Computational Biology, 3 no. 4 pp. e65, 2007. doi:10.1371/journal.pcbi.0030065

Sebastian Will, Tejal Joshi, Ivo L. Hofacker, Peter F. Stadler, and Rolf Backofen. LocARNA-P: Accurate boundary prediction and improved detection of structural RNAs. RNA, 18 no. 5 pp. 900-914, 2012. doi:10.1261/rna.029041.111

Sebastian Will, Michael Yu, and Bonnie Berger. Structure-based Whole Genome Realignment Reveals Many Novel Non-coding RNAs. Genome Research, no. 23 pp. 1018-1027, 2013. doi:10.1101/gr.137091.111