NAME

Games::Dissociate - a Dissociated Press algorithm and filter

SYNOPSIS

    use Games::Dissociate;
    ...
    $brilliant_prose = dissociate($normal_prose);

    perl -MGames::Dissociate -e dissociate_filter meno.txt

ABSTRACT

This module provides the function "dissociate", which implements a Dissociated Press algorithm, well known to Emacs users as "meta-x dissociate". The algorithm here is by no means a straight port of Emacs's "dissociate.el", but is instead merely inspired by it.

(I actually intended to make it a straight port, but couldn't manage it -- the code in "dissociate.el" is totally uncommented, and is especially obscure Lisp.)

This module also provides a procedure "dissociate_filter", for use in the one-liner context:

  perl -MGames::Dissociate -e 'dissociate_filter(2)'
    < thesis.txt  > snip.txt

  perl -MGames::Dissociate -e 'dissociate_filter(-2)'
    < thesis.txt  > snip.txt

or in a script consisting of

  #!/usr/local/bin/perl
  use Games::Dissociate;  dissociate_filter;

Sample Dissociation

I got this text from feeding the UNIX man page for "regexp" (in plaintext) to "dissociate" with a $group_size parameter of 3:

nd of then the full list of the more branch is zero or "*", "." (matching thand regexp(n) right initional argumented by a pieces of the left to match that (ab|a) general other worDS match to the first, followed by "?". It matcheS In of the next start was been could exp. The characters in expreSSIons belowed in the full matching the in starticular EXpression in "[0-9]" include a list of sequence of the are may before the regexp even therwise. REgexp(n) Tcl regular expression to regexp(n) regexp(n) right. Input string), "\",

About Dissociated Press algorithms

"Dissociated Press" algorithms produce text with token-patterns (patterns of words, or patterns of characters) similar those found to an input text.

This may be implemented in terms of Markov chains (basically, statistical modeling of frequency of token-groups), altho both this module and Emacs's "dissociate.el" take shortcuts to avoid having to construct and manipulate a real statistical model of the input text.

Basically, the way Dissociated Press algorithms (at least mine -- I can't speak for the exact details of all others) work is:

1. Start at a random point in the text, and read a group of tokens (characters or words from there -- where group size is a parameter you change) from there. Call this the last-matched group.

2. Output the last-matched group.

3. Look for the other times the last-matched group occurs in the text, and randomly select one of them. (Or: select the next time that group occurs -- a shortcut I've made in the code, which seems to still produce random-looking results). Look at the group of tokens that occurs right after that. Make that the last-matched group. Loop back to Step 2 until we think we've outputted enough.

4. But if the last-matched group from 2 occurred just that once in the text, go back to step 1.

Since the groups of characters or words (at least, when you look at them as bits of text only group-size tokens long) are all taken from the input text, you get somewhat natural-looking text -- as opposed to what you'd get if you just randomly outputted single characters or single words from the input text.

The process of applying a DissociatedPress algorithm to a bit of text is called "dissociation".

PARAMETERS AND USAGE

To use this module after you've installed it, say "use Games::Dissociate". This imports the function "dissociate" and the procedure "dissociate_filter".

dissociate($input, $group_size, $max)

The function "dissociate" takes three parameters:

  $output = dissociate($input, $group_size, $max);

$input is the input string, hopefully containing a stretch of (plaintext) text in a human language, encoded either in just plain US-ASCII, or in a character-encoding your locale settings know about. $output will be "dissociated text" (charmingly generated gibberish) based on that input text. (Note that output will contain no line-breaks or tabs. Yoy may wish, as "dissociate_filter" does, to pass the output thru Text::Wrap's "wrap".)

You'll get strange output if $input contains markup (HTML, LaTeX, etc.), or is very short, or is not in a human language.

$group_size is the number of tokens (words or characters) that must be in common between bits of text the dissociation algorithm skips between. A positive value means you want to dissociate by character, with a group-size of that many characters (4 = 4 characters); a negative value means you want to dissociate by word, with a group size of that many words (-2 = 2 words). I suggest values between -3 and 5; I'm a fan of -2. A $group_size value of 0 or 1 is invalid, and currently causes "dissociate" to use the default value of 2 (2 characters) instead. A value of -1 is invalid, and currently causes "dissociate" to use the value of -2 (2 words) instead. The behavior/validity of $group_size values of 0, 1, or -1 may change in future versions.

$max is a parameter used to control the maximum number of iterations of "dissociate"'s central loop -- it corresponds roughly to the number of "chunks" of text you get back, where a chunk is N * -$group_size words for negative values of $group_size, and N * $group_size characters for positive values of $group_size. $max must be greater than 1.

If you need (!) more precise control over the size of the output text, try setting set $max high and trim the output to size, and/or try calling "dissociate" multiple times until you get the amount of output you want. (But be sure to give up if "dissociate" keeps returning nullstring, as it will in some strange cases.)

"dissociate" can also be called with the following syntaxes:

  dissociate($input, $group_size);
   # acts like max of 100

  dissociate($input);
   # acts like group size of 2 (characters) and max of 100

dissociate_filter()

dissociate_filter($group_size)

dissociate_filter($group_size, $max)

This library also provides the procedure "dissociate_filter", which pulls input from "<>" (files specified on the command line, or STDIN), and sends dissociated output to STDOUT. It can be called with these syntaxes:

  dissociate_filter($group_size, $max);

  dissociate_filter($group_size);
   # uses a default value for $max

  dissociate_filter();
   # uses a default value for $group_size and $max

These above-mentioned default values can come from command line switches, if you make a script consisting of:

  #!/usr/local/bin/perl
  use Games::Dissociate;
  dissociate_filter;

and call that script, say, "dissociate", and call it as:

  dissociate -c5 -m200 < foo.txt

  dissociate -w2 -m70 foo.txt bar.txt | less

and so on.

To explain the switches:

"-w[number]" specifies a by-word dissociation with that number of words as the group size, "-c[number]" specifies a by-character dissociation with that number of characters as the group size, "-m[number]" specifies a default for $max.

If you don't specify a default for $group_size or $max, $group_size defaults to 100 and $max defaults to 2 (characters).

Efficiency Notes

This module has to search the input string by performing regexp searches on it. In the current version of this module, control over compilation of regular expressions may not be not optimally efficient. Perl 5.005 provides options to better control regexp compilation; once Perl 5.005 is in wider use, I may come out with a new version of Games::Dissociate requiring Perl 5.005 or later, using these new regexp compilation control features.

If you feed this module a lot of text (over 50K, say), it will indeed get very slow (notably with by-word dissociation), since that whole chunk of text has to be searched over and over and over.

If you have an idea for making this module more efficient, feel free to email it to me.

Internationalization Notes

When dealing with text in heavily inflected languages (like Finnish -- lots of unique word endings, frequently used), this module will require longer input text to produce interesting results for by-word dissociation, compared to relatively inflection-poor languages like English.

For text written with no inter-word spacing (often the case with Thai, for example), there's no way for this module to tell where the word breaks are -- in such cases, use only the by-character mode.

The current version of this library assumes "/./" matches a single character, for by-character dissociation; and, for by-word dissociation, that "/\w+/" matches whole words and /\W+/ matches non-word strings. These are locale-dependent functions, and Games::Dissociate has a "use locale" in it, hopefully triggering correct behavior for your favorite locale, language, and character-encoding. Consult perllocale and locale for more information on locales.

I have found "use locale" to do unwelcome things (like unceremoniously dumping core) on a few very strange, very old (and otherwise barely-working) machines. If this is a problem for you, or if you don't plan to use locales, comment out the "use locale" in the Games::Dissociate source code.

The treatment of locales and support for them may change in future versions of this module, depending on how future Perl versions shape up, particularly in their support of Unicode.

Randomness Notes

This library uses "rand" extensively, but never calls "srand". If you're getting the same dissociated output all the time, then you're using an old (pre-5.004) version of Perl that doesn't do implicit randomness seeding -- just call "srand();", maybe right after you say "use Games::Dissociate";

COPYRIGHT

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

REMINDER

It's just a toy.

AUTHOR

Current maintainer Avi Finkel "avi@finkel.org"; Original author Sean M. Burke <sburke@cpan.org>