

 
Manual Reference Pages  PRDF (1)
NAME
prdf  test a protein sequence similarity for significance
CONTENTS
Synopsis
Description
Examples
Options
Author
SYNOPSIS
prdf [f # g # h k # O filename s SMATRIX
w windowsize ]
sequencefile1 sequencefile2 [
ktup ] [
#ofshuffles ]
prdf [fghks]
 interactive mode
DESCRIPTION
prdf is used to evaluate the significance of a protein sequence similarity
score by comparing two sequences and calculating initial and optimized
similarity scores, and then repeatedly shuffling the second sequence,
and calculating the initial and optimized scores. Extreme value
distributions are then fit to each of the three distributions of
scores. The characteristic parameters of the extreme value
distribution are then used to estimate the probability that each of
the unshuffled sequence scores would be obtained by chance in one
sequence, or in a number of sequences equal to the number of shuffles.
This program is derived from rdf2, which was described by Pearson and
Lipman, PNAS (1988) 85:24442448, and Pearson (Meth. Enz. 183:6398).
Use of the extreme value distribution for estimating the probabilities
of similarity scores was described by Altshul and Karlin, PNAS (1990)
87:22642268. The ’zvalues’ calculated by rdf2 are not as
informative as the Pvalues and expectations calculated by prdf.
prdf also allows a more sophisticated shuffling method: residues can be shuffled
within a local window, so that the order of residues 110, 1120, etc,
is destroyed but a residue in the first 10 is never swapped with a residue
outside the first ten, and so on for each local window.
EXAMPLES
(1)

prdf w 10 musplfm.aa lcbo.aa 1 250

Compare the amino acid sequence in the file musplfm.aa with that
in lcbo.aa, then shuffle lcbo.aa 250 times using a local shuffle with
a window of 10 and calculate initial
and optimized similarity scores using Ktup = 1. Report the significance of the
unshuffled musplfm/lcbo comparison scores with respect to the shuffled
scores.

(2)

prdf musplfm.aa lcbo.aa 2

Compare the amino acid sequence in the file musplfm.aa with the sequences
in the file lcbo.aa using
ktup 2.

(3)

prdf


Run prdf in interactive mode. The program will prompt for
the file name of the two query sequence files, the
ktup, and the number of shuffles to be used. 100 shuffles are calculated by
default; 250  500 shuffles should provide more accurate probability
estimates.
OPTIONS
prss can be directed to change the scoring matrix, gap penalties, and
shuffle parameters by entering options on the command line (preceeded
by a ‘’). All of the options should preceed the file names number of
shuffles.

f #

Penalty for the first residue in a gap (12 by default).

g #

Penalty for additional residues in a gap (2 by default).

h

Do not display histogram of similarity scores.

k #

(GAPCUT) Sets the threshold for joining the initial regions for calculating the
initn score.

Q q

"quiet"  do not prompt for filename.

O filename
 
send copy of results to "filename."

s str

(SMATRIX) the filename of an alternative scoring matrix file. For protein
sequences, BLOSUM50 is used by default; PAM250 can be used with the
command line option
s 250 (or with s pam250.mat).


SEE ALSO
fasta(1),lfasta(1),prss(1),protcodes(5)
AUTHOR
Bill Pearson
wrp@virginia.EDU
The curve fitting routines in rweibull.c were provided by Phil Green,
Washington U., St. Louis.
Visit the GSP FreeBSD Man Page Interface. Output converted with manServer 1.07. 