NAME

RNAforester - compare RNA secondary structures via forest alignment

SYNOPSIS

RNAforester [options]
Options are:
--help shows this help info
--version shows version information
-d calculate distance instead of similarity
-r calculate relative score
-l local similarity
-so=int local suboptimal alignments within int%
-s small-in-large similarity
-m multiple alignment mode
-mt=double clustering threshold
-mc=double clustering cutoff
-p predict structures from sequences
-pmin=num minimum basepair frequency for prediction
-pm=int basepair(bond) match score
-pd=int basepair bond indel score
-bm=int base match score
-br=int base mismatch score
-bd=int base indel score
--RIBOSUM RIBOSUM85-60 scoring matrix
-cmin=double minimum basepair frequency for consensus structure
-2d generate alignment 2D plots in postscript format
--2d_hidebasenum hide base numbers in 2D plot
--2d_basenuminterval=n show every n-th base number
--2d_grey use only grey colors in 2D plots
--2d_scale=double scale factor for the 2d plots
--score compute only scores, no alignment
--fasta generate fasta output of alignments
-f=file read input from file
--noscale suppress output of scale

DESCRIPTION

RNAforester calculates RNA secondary structure alignments, both pairwise and multiple. The comparison is based on the tree alignment model [1,2].

Model

The model for pairwise and multiple alignment differs slightly. The pairwise model is based on the following edit operations on sequence and structure:

basepair replacement/match: A basepair, INCLUDING the paired bases, is substituted by another basepair. The scoring contribution is p_m.
basepair bond deletion: A basepair bond WITHOUT the paired bases is removed. The scoring contribution is p_d.
Sequence edit operations: Base match/mismatch and base deletion give the scoring contributions b_m and b_d, respectively.

In the multiple alignment mode (-m), parameter p_m is the score for matching a basepair bond WITHOUT the paired bases. Thus, the score for a whole basepair replacement is p_m+2*b_m. For more information about multiple alignment refer to the description of parameter -m.

Input

RNAforester reads RNA secondary structures from stdin by default. It accepts sequences and structures in Fasta format, where matching brackets symbolize base pairs and unpaired bases are represented by a dot. A line containing the primary sequence can precede the RNA secondary structure(s). An example is given below:

> test accaguuacccauucgggaaccggu primary structure .((..(((...)))..((..)))). secondary structure

All characters after a "blank" are ignored and all '-' characters are removed. The program will continue to read new structures until a line consisting of the single character @ or an end of file is encountered. Input lines starting with > can contain a structure name.

Option -f=filename let RNAforester read the input from file. Results files are then written to files prefixed by filename.

Output

Alignments in ASCII format are written to stdout. Option -2d generates postscript drawings of structure alignments.

Options

-d

Calculate distance instead of similarity. In contrast to similarity, scoring contributions are minimized. The scoring parameters must not be negative and equal structures achieve a distance of zero. This parameter can not be used in conjunction with multiple alignment, where relative similarity is computed.

-r

Calculate relative score, defined by sr(a,b)=2*s(a,b)/(s(a,a)+s(b,b). Relative scores are upper bounded by 1 which is the score for equal structures.

-l

Calculate local similar structures. The term local refers to subwords of the input sequences and structures. If parameter -so is used suboptimal solutions are calculated. This does not mean suboptimal solutions of the same local structures, but different substructures which do not include each other.

-so=int

Calculates suboptimal local alignments within int% of the optimum. This option requires option -l.

-s

Calculates small-in-large similarity, i.e. the best alignment of the first structure against all substructures of the second structure is computed.

-m, -mc=double, -mt=double, -cmin=double

Multiple alignment mode. Multiple alignments of structures are calculated in a progressive fashion. First, an all-against-all comparison of structures is performed (relative scores) and afterwards structural alignments are joined along a guide tree (the guide tree is constructed dynamically). If the best score which a single structure or structure alignment can achieve by aligning to all others is below cutoff value -mc, it is not joined and put into the results list. Thus, a multiple structure alignment can produce a list of alignments. The main purpose of parameter -mc is to identify alternative and wrong structures produced by structure predictions. The default value for -mc is zero, as this separates similar from dissimilar in a similarity scoring model.

In each step in the multiple alignment calculation, the best scoring pair is joined and then the guide tree is adjusted. To speed up computation, parameter -mt defines a threshold whereas, if this is exceeded, multiple pairs are joined and then the guide tree is adjusted.

Besides sequence and structure alignment, a consensus sequence and structure is computed. The minimum pair frequency probability for a basepair in the consensus sequence is controlled by parameter -cmin.

The console output could look like (just a part):

* * **** * * **** ** * **** ** * **** * ** * **** ******** **** ** * **** ******** **** ** * **** ******** **** **************** ** * **************** ****** **************** ** **************************** **************** ** **************************** ggggcuauagcucagcugggggagcuauagcucagcugggagcgggga .((((....))))....((.(.(((((..((((........))))... ************************************************ **************** ** **************************** **************** ** ** ************************* **************** ** * *************** ******* ** * **** ******** ***** ** * **** ******** ***** ** * **** ******* *** * ** * **** * * * **** * * ****

The number of * above the primary sequence shows the frequency of the base. Each * stands for 10% frequency. Accordingly, the number of * below the secondary structure show the frequency of the occurrence of a paired or unpaired base.

The guide tree is written to a file "cluster.dot" in dot format. If a filename was specified by parameter -f the filename is "filename_cluster.dot". Refer to http://www.research.att.com/sw/tools/graphviz for more details about the dot format and tools.

-p, -pmin=double

Structures (in fact, a consensus of compatible structures) are predicted from the partition function which is calculated using the Vienna RNA library [3]. Structure lines in the input are ignored. -pmin is the minimum frequency of a basepair which must be exceeded to be considered for the prediction of structures.

-pm=int,-pd=int,-bm=int,-br=int,-bd=int

Scoring parameters. Refer to Section DESCRIPTION.

--RIBOSUM

Uses the base and basepair substitution matrix RIBOSUM85-60 matrix as proposed in [4]. Requires pairwise alignment model.

-2d

RNAforester provides different types of visualizations for pairwise and multiple alignment.

pairwise alignment Since bases paired in a structure S1 can be aligned to bases unpaired in a structure S2, the presentation of a common secondary structure leaves some choice. For an alignment of those structures, an RNA secondary structure "$S2-at-S1" is drawn that highlights the differences as deviations of S2 from S1, or vice versa, "S1-at-S2". Both are alternative visualizations of the same alignment. Bases printed in black show structure elements that occur in both structures with the same sequence. Sequence variations are displayed by using red letters. Bases or base pairs that can only be found in S1 are printed in blue, while bases that only occur in S2 are printed in green.

The drawings are written to files "x_n.ps" and "y_n.ps" where n is the number of the alignment. n enumerates the suboptimal solutions if option -so is used. The region of local similarity are highlighted in the original structures in the drawings "x_str.ps" and "y_str.ps".

multiple alignment Each cluster of the result list of a multiple alignment is visualized in two alternative drawings, written to the files "filename_cons_n.ps" and "filename_n_.ps" if option -f is used. In both plots, the consensus structure is shown. The lighter a basepair bond is drawn, the less frequent does it exist in the structures. Bases or basepair bonds that have a frequency of one hundred percent are drawn in red color. In "filename_cons_n.ps", the most frequent base at each residue is printed, with the base frequency indicated by grey-scale. In "filename_n.ps", the frequencies of the bases a,c,g,u are proportional to the radius of circles that are arranged clockwise on the corners of a square, starting at the upper left corner. Additionally, these circles are colored red, green, blue, magenta for the bases a,c,g,u, respectively. The frequency of a gap is proportional to a black circle growing at the center of the square.

Parameters --2d_hidebasenum,--2d_basenuminterval=n,--2d_grey,--2d_scale=double effect the drawings of alignments and consensus structures as implied by their names.

--score

Only the optimal score of an alignment is printed. This option is useful when RNA-forester is called by another program that only needs a similarity or distance value.

--fasta

Alignments are printed in Fasta format

REFERENCES

[1] Jiang T, Wang J T L and Zhang K, (1995) Alignment of Trees - An Alternative to Tree Edit, Theoretical Computer Science 143(1), 137-148

[2] Hoechsmann M, Toeller T, Giegerich R and Kurtz S, (2003) Local Similarity of RNA Secondary Structures, Proc. of the IEEE Bioinformatics Conference (CSB 2003), 159-168

[3] Ivo L. Hofacker, Walter Fontana, Peter F. Stadler, L. Sebastian Bonhoeffer, Manfred Tacker, and Peter Schuster, (1994) Fast Folding and Comparison of RNA Secondary Structures, Monatsh.Chem. 125: 167-188.

[4] Klein R.J. and Eddy S.R., (2003) RSEARCH: finding homologs of single structured RNA sequences, BMC Bioinformatics. 2003 Sep 22;4(1):44

VERSION

This man page documents version 1.4 of RNAforester.

AUTHORS

Matthias Hoechsmann

BUGS

I hope you wouldn't find them. Comments should be sent to mhoechsm@techfak.uni-bielefeld.de