Math::String::Charset  A simple charset for Math::String objects.
use Math::String::Charset;
$a = new Math::String::Charset; # default az
$b = new Math::String::Charset ['a'..'z']; # same
$c = new Math::String::Charset
{ start => ['a'..'z'], sep => ' ' }; # with ' ' between chars
print $b>length(); # az => 26
# construct a charset from bigram table, and an initial set (containing
# valid startcharacters)
# Note: After an 'a', either an 'b', 'c' or 'a' can follow, in this order
# After an 'd' only an 'a' can follow
$bi = new Math::String::Charset ( {
start => 'a'..'d',
bi => {
'a' => [ 'b', 'c', 'a' ],
'b' => [ 'c', 'b' ],
'c' => [ 'a', 'c' ],
'd' => [ 'a', ],
'q' => [ ], # 'q' will be automatically in end
}
end => [ 'a', 'b', ],
} );
print $bi>length(); # 'a','b' => 2 (cross of end and start)
print scalar $bi>class(2); # count of combinations with 2 letters
# will be 3+2+2+1 => 8
$d = new Math::String::Charset ( { start => ['a'..'z'],
minlen => 2, maxlen => 4, } );
print $d>first(0),"\n"; # undef, too short
print $d>first(1),"\n"; # undef, to short
print $d>first(2),"\n"; # 'aa'
$d = new Math::String::Charset ( { start => ['a'..'z'] } );
print $d>first(0),"\n"; # ''
print $d>first(1),"\n"; # 'a'
print $d>last(1),"\n"; # 'z'
print $d>first(2),"\n"; # 'aa'
perl5.005, Exporter, Math::BigInt
Exports nothing on default, can export "analyze".
This module lets you create an charset object, which is used to contruct
Math::String objects. This object knows how to handle simple charsets as well
as complex onex consisting of bigrams (later tri and more).
In case of more complex charsets, a reference to a Math::String::Charset::Nested
or Math::String::Charset::grouped will be returned.
 Default charset
 The default charset is the set containing
"abcdefghijklmnopqrstuvwxyz" (thus producing always lower case
output).
Upon error, the field "_error" stores the error message, then
die() is called with this message. If you do not want the program to
die (f.i. to catch the errors), then use the following:
use Math::String::Charset;
$Math::String::Charset::die_on_error = 0;
$a = new Math::String::Charset (); # error, empty set!
print $a>error(),"\n";
This object caches certain calculation results (f.i. the number of possible
combinations for a certain string length), thus greatly speeding up
sequentiell Math::String conversations from string to number, and vice versa.
All characters used to construct the charset must have the same length, but need
not neccessarily be one byte/char long.
The complexity for converting from number to string, and vice versa, is O(N),
with N beeing the number of characters in the string.
Actually, it is a bit higher, since the underlying Math::BigInt needs more time
for longer numbers than for shorts. But usually the practically string length
limit is reached before this effect shows up.
See BENCHMARKS in Math::String for runtime details.
With a simple charset, converting between the number and string is relatively
simple and straightforward, albeit slow.
With bigrams, this becomes even more complex. But since all the information on
how to convert between number and string in inside the charset definition,
Math::String::Charset will produce (and sometimes cache) this information.
Thus Math::String is simple a hull around Math::String::Charset and
Math::BigInt.
Depending on the charset, the order in which Math::String 'sees' the strings is
different. Example with charset 'A'..'D':
A 1
B 2
C 3
D 4
AA 5
AB 6
AC 7
AD 8
BA 9
BB 10
BC 11
..
AAA 20
AAB 21 etc
The order of characters does not matter, 'B','D','C','A' will produce similiar
results, though in a different order inside Math::String:
B 1
D 2
C 3
A 4
BB 5
BD 6
BC 7
..
BBB 20
BBD 21 etc
Here is an example with characters of length 3:
foo 1
bar 2
baz 3
foofoo 4
foobar 5
foobaz 6
barfoo 7
barbar 8
barbaz 9
bazfoo 10
bazbar 11
bazbaz 12
foofoofoo 13 etc
All charset items must have the same length, unless you use a separator string:
use Math::String;
$a = Math::String>new('',
{ start => [ qw/ the green car a/ ], sep => ' ' } );
while ($b ne 'the green car')
{
$a ++;
print "$a\t"; # print "a green car" etc
}
The separator is a string, not a regexp and it must not be present in any of the
characters of the charset.
The old way was using a fill character, which is more complicated:
use Math::String;
$a = Math::String>new('', [ qw/ the::: green: car::: a:::::/ ]);
while ($b ne 'the green car')
{
$a ++;
print "$a\t"; # print "a:::::green:car:::" etc
$b = "$a"; $b =~ s/:+/ /g; $b =~ s/\s+$//;
print "$b\n"; # print "a green car" etc
}
This produces:
the::: the
green: green
car::: car
a::::: a
the:::the::: the the
the:::green: the green
the:::car::: the car
the:::a::::: the a
green:the::: green the
green:green: green green
green:car::: green car
green:a::::: green a
car:::the::: car the
car:::green: car green
car:::car::: car car
car:::a::::: car a
a:::::the::: a the
a:::::green: a green
a:::::car::: a car
a:::::a::::: a a
the:::the:::the::: the the the
the:::the:::green: the the green
the:::the:::car::: the the car
the:::the:::a::::: the the a
the:::green:the::: the green the
the:::green:green: the green green
the:::green:car::: the green car
Now imagine a charset that is defined as follows:
Starting characters for each string can be 'a','c','b' and 'd' (in that order).
Each 'a' can be followed by either 'b', 'c' or 'a' (again in that order), each
'c can be followed by either 'c', 'd' (again in that order), and each 'b' or
'd' can be followed by an 'a' (and nothing else).
The definition is thus:
use Math::String::Charset;
$cs = Math::String::Charset>new( {
start => [ 'a', 'c', 'b', 'd' ],
bi => {
'a' => [ 'b','c','a' ],
'b' => [ 'a', ],
'd' => [ 'a', ],
'c' => [ 'c','d' ],
}
} );
This means that each character in a string depends on the previous character.
Please note that the probabilities on which characters follows how often which
character do not concern us here. We simple enumerate them all. Or put
differently: each probability is 1.
With the charset above, the string sequence runs as follows:
string number count of strings
with length
a 1
c 2
b 3
d 4 1=4
ab 5
ac 6
aa 7
cc 8
cd 9
ba 10
da 11 2=7
aba 12
acc 13
acd 14
aab 15
aac 16
aaa 17
ccc 18
ccd 19
cda 20
bab 21
bac 22
baa 23
dab 24
dac 25
daa 26 3=15
abab 27
abac 28
abaa 29
accc 30
accd 31
acda 32
aaba 33
aacc 34
aacd 35 etc
There are 4 strings with length 1, 7 with length 2, 15 with length 3 etc. Here
is an example for
first() and
last():
$charset>first(3); # gives aba
$charset>last(3); # gives daa
Sometimes, you want to specify that a string can end only in certain characters.
There are two ways:
use Math::String::Charset;
$cs = Math::String::Charset>new( {
start => [ 'a', 'c', 'b', 'd' ],
bi => {
'a' => [ 'b','c','a' ],
'b' => [ 'a', ],
'd' => [ 'a', ],
'c' => [ 'c','d' ],
}
end => [ 'a','b' ],
} );
This defines any string ending not in 'a' or 'b' as invalid. The sequence runs
thus:
string number count of strings
with length
a 1
b 2 2
ab 4
aa 5
ba 6
da 7 4
aba 8
aab 9
aaa 10
cda 11
bab 12
baa 13
dab 14
daa 15 8
abab 16
abaa 17 etc
There are now only 2 strings with length 1, 4 with length 2, 8 with length 3
etc.
The other way is to specify the (additional) ending restrictions implicit by
using chars that are not followed by other characters:
use Math::String::Charset;
$cs = Math::String::Charset>new( {
start => [ 'a', 'c', 'b', 'd' ],
bi => {
'a' => [ 'b','c','a' ],
'b' => [ 'a', ],
'd' => [ 'a', ],
'c' => [ ],
}
} );
Since 'c' is not followed by any characters, there are no strings with a 'c' in
the middle (which means strings can end in 'c'):
string number count of strings
with length
a 1
c 2
b 3
d 4 4
ab 5
ac 6
aa 7
ba 8
da 9 5
aba 10
aab 11
aac 12
aaa 13
bab 14
bac 15
baa 16
dab 17
dac 18
daa 19 10
abab 20
abac 21 etc
There are now 4 strings with length 1, 5 with length 2, 10 with length 3 etc.
Any character that is not followed by another character is automatically added
to "end". This is because otherwise you would have created a
rendundand character which could never appear in any string:
Let's assume 'q' is not in the "end" set, and not followed by any
other character:
 1.
 There can no string "q", since strings of lenght 1 start
and end with their only character. Since 'q' is not in
"end", the string "q" is invalid (no matter wether 'q'
appears in "start" or not).
 2.
 No string longer than 1 could start with 'q' or have a 'q' in the middle,
since 'q' is not followed by anything. This leaves only strings with
length 1 and these are invalid according to rule 1.
From now on, a 'class' refers to all strings with the same length. The order or
length of a class is the length of all strings in it.
With a simple charset, each class has exactly M times more strings than the
previous class (e.g. the class with a length  1). M is in this case the
length of the charset.
To convert between string and number, we must simple know which string has which
number and which number is which string. Although this sounds very difficult,
it is not so. With 'simple' charsets, it only involves a bit of math.
First we need to know how many string are in the class. From this information we
can determine the lenght of a string given it's number, and get the range
inside which the number to a string lies:
Let's stick to the example with 4 characters above, 'A'..'D':
Stringlenght strings with that length first in range
1 4 1
2 16 (4*4) 5
3 64 (4*4*4) 21
4 4**4 85
5 4**5 etc 341
You see that this is easy to calculate. Now, given the number 66, we can
determine how long the string must be:
66 is greater than 21, but lower than 85, so the string must be 3 characters
long. This information is determined in O(N) steps, wheras N is the length of
the string by successive comparing the number to the elements in all string of
a certain length.
If we then subtract from 66 the 21, we get 45 and thus know it must be the
fourtyfifth string of the 3 character long ones.
The math involved to determine which 3 characterstring it actually is equally
to converting between decimal and hexadecimal numbers. Please see source for
the gory, but boring details.
For charsets of higher order, even determining the number of all strings in a
class becomes more difficult. Fortunately, there is a way to do it in N steps
just like with a simple charset.
The first way is based on the observation that the number of strings in class
n+1 only depends on the number of ending chars in class n, and nothing else.
This is, however, not used in the current implemenation, since there is a
slightly faster/simpler way based on the count of strings that start with a
given character in class n, n1, n2 etc. See below for a description.
Here is for reference the example with ending char counts:
use Math::String::Charset;
$cs = Math::String::Charset>new( {
start => [ 'a', 'c', 'b', 'd' ],
bi => {
'a' => [ 'b','c','a' ],
'c' => [ 'c','d' ],
'b' => [ 'a', ],
'd' => [ 'a', ],
}
} );
Class 1:
a 1
c 2
b 3
d 4 4
As you can see, there is one 'a', one 'c', one 'b' and one 'd'. To determine how
many strings are in class 2, we must multiply the occurances of each character
by the number of how many characters it is followed:
a * 3 + c * 2 + d * 1 + b * 1
which equals
1 * 3 + 1 * 2 + 1 * 1 + 1 * 1
If we summ this all up, we get 3+2+1+1 = 7, which is exactly the number of
strings in class 2. But to determine now the number of strings in class 3, we
must now how many strings in class 2 end on 'a', how many on 'b' etc.
We can do this in the same loop, by not only keeping a sum, but by counting all
the different endings. F.i. exactly one string ended in 'a' in class 1. Since
'a' can be followed by 3 characters, for each character we know that it will
occure at least 1 time. So we add the 1 to the character in question.
$new_count>{'b'} += $count>{'a'};
This yields the amounts of strings that end in 'b' in the next class.
We have to do this for every different starting character, and for each of the
characters that follows each starting character. In the worst case this means
M*M steps, while M is the length of the charset. We must repeat this for each
of the classes, so that the complexity becomes O(N*M*M) in the worst case. For
strings of higher order this gets worse, adding a *M for each higher order.
For our example, after processing 'a', we will have the following counts for
ending chars in class 2:
b => 1
c => 1
a => 1
After processing 'c', it is:
b => 1
c => 2 (+1)
a => 1
d => 1 (+1)
because 'c' is followed by 'd' or 'c'. When we are done with all characters, the
following count's are in our $new_count hash:
b => 1
c => 2
a => 3
d => 1
When we sum them up, we get the count of strings in class 2. For class 3, we
start with an empty count hash again, and then again for each character
process the ones that follow it. Example for a:
b => 0
c => 0
a => 0
d => 0
3 times ending in 'a' followed by 'b','c' or 'd':
b => 3 (+3)
c => 3 (+3)
a => 3 (+3)
d => 0
2 times ending 'c' followed by 'c' or 'd':
b => 3
c => 5 (+2)
a => 3
d => 2 (+2)
After processing 'b' and 'd' in a similiar manner we get:
b => 3
c => 5
a => 5
d => 2
The sum is 15, and we know now that we have 15 different strings in class 3. The
process for higher classes is the same again, reusing the counts from the
lower class.
The second, and implemented method counts for each class how many strings start
with a given character. This gives us two information at once:
 •
 A string of length N and a starting char of X, which number it must have
at minimum (by summing up the counts of all strings that come before X)
and how many strings are there starting with X (although this is not used
for X, but only for all strings that come after X).
 •
 How many strings are there with a given length, by summing up all the
counts for the different starting chars.
This method also has the advantage that it doesn't need to recalculate the
count for each level. If we have cached the information for class 7, we can
calculate class 8 rightaway. The old method would either need to start at
class 1, working up to 8 again, or cache additional information of the order N
(where N is the number of different characters in the charset).
Here is how the second method works, based on the example above:
start => [ 'a', 'c', 'b', 'd' ],
bi => {
'a' => [ 'b','c','a' ],
'c' => [ 'c','d' ],
'b' => [ 'a', ],
'd' => [ 'a', ],
}
The sequence runs as follows:
String Strings starting with
this character in this level
a 1
c 1
b 1
d 1
ab
ac
aa 3 (1+1+1)
cc
cd 2 (1+1)
ba 1
da 1
aba
acc
acd
aab
aac
aaa 6 1 (b) + 2 (c) + 3 (a)
ccc
ccd
cda 3 2 (c) + 1 (d)
bab
bac
baa 3
dab
dac
daa 3
abab
abac
abaa
accc etc
As you can see, for length one, there is exactly one string for each starting
character.
For the next class, we can find out how many strings start with a given char, by
adding together all the counts of strings in the previous class.
F.i. in class 3, there are 6 strings starting with 'a'. We find this out by
adding together 1 (there is 1 string starting with 'b' in class 2), 2 (there
are two strings starting with 'c' in class 2) and 3 (three strings starting
with 'a' in class 2).
As a special case we must throw away all strings in class 2 that have invalid
ending characters. By doing this, we automatically have restricted
all
strings to only valid ending characters. Therefore, class 1 and 2 are setup
upon creating the charset object, the others are calculated ondemand and then
cached.
Since we are calculating the strings in the order of the starting characters, we
can sum up all strings up to this character.
String First string in that class
a 0
c 1
b 2
d 3
ab 0
ac
aa
cc 3
cd
ba 5
da 6
aba 0
acc
acd
aab
aac
aaa
ccc 6
ccd
cda
bab 9
bac
baa
dab 12
dac
daa
abab 0
abac
abaa
accc etc
When we add to the number of the last character (f.i. 12 in case of 'd' in class
3) the amount of strings with that character (here 3), we end up with the
number of all strings in that class.
Thus in the same loop we calculate:
 how many stings start with a given character in this class
 what is the first number of a string starting with 'x' in that class
 how many strings are in this class at all
That should be all we need to know to convert a string to it's number.
From the section above we know that we can find out which number a string of a
certain class has at minimum and at maximum. But what number has the string in
that range, actually?
Well, given the information it is easy. First, find out which minimum number a
string has with the given starting character in the class. Add this to it's
base number. Then reduce the class by one, look at the next character and
repeat this. In pseudo code:
$class = length ($string); $base = base_number>[$class];
foreach ($character)
{
$base += $sum>[$class]>{$character};
$class ;
}
So, after N simple steps (where N is the number of characters in the string), we
have found the number of the string.
Section not ready yet.
It helps to imagine the strings like a couple of trees (ASCII art is crude):
class: 1 2 3 etc
number
1 a
5 +ab
12  +aba
6 +ac
13  +acc
14  +acd
7 +aa
15 +aab
16 +aac
17 +aaa
2 c
8 +cc
18  +ccc
19  +ccd
9 +cd
20 +cda
3 b
10 +ba
21 +bab
22 +bac
23 +baa
4 d
11 +da
24 +dab
25 +dac
26 +daa
As you can see, there is a (independend) tree for each of the starting
characters, which in turn contains independed subtrees for each string in the
next class etc. It is interesting to note that each string deeper in the tree
starts with the same common starting string, aka 'd', 'da', 'dab' etc.
With a simple charset, all these trees contain the same number of nodes. With
higher order charsets, this is no longer true.
 new()

new();
Create a new Math::String::Charset object.
The constructor takes either an ARRAY or a HASH reference. In case of the
array, all elements in that array will be used as characters in the
charset, and the charset will be of order 0, type 0.
If given a HASH reference, the following keys can be used for all charsets:
minlen Minimum string length, inf if not defined
maxlen Maximum string length, +inf if not defined
The following keys can only be used in certain combinations, which will be
explained below:
bi hash, table with bigrams
sets hash, table with charsets for the different places
start array ref to list of all valid (starting) characters
end array ref to list of all valid ending characters
sep separator character, none if undef (only for order 1)
If you use neither bi nor sets, the charset will be of order
1, type 0. If you use a hash key named bi, the charset will be of
order 2, type 0. If you use a hash key named sets, the charset will
be of order 1, type 1.
For a charset of type 0, order 1 (simpel set) the following keys are valid:
start required
end optional (to restrict number of 1character strings)
sep optional
For a charset of type 0, order 2 (bigram set) the following keys are valid:
start optional
end optional
bi required
For a charset of type 1, order 1 (grouped set) the following keys are valid:
sets required
 start
 "start" contains an array reference to all valid starting
characters, e.g. no valid string can start with a character not listed
here.
 bi
 "bi" contains a hash reference, each key of the hash points to
an array, which in turn contains all the valid combinations of two
letters.
 sets
 "sets" contains a hash reference, each key of the hash indicates
an index. Each of the hash entries points either to an ARRAY reference or
a Math::String::Charset of order 1, type 0.
Positive indices count from the left side, negative from the right. 0
denotes the default.
At each of the position indexed by a key, the appropriate charset will be
used.
Example for specifying that strings must start with upper case letters,
followed by lower case letters and can end in either a lower case letter
or a number:
sets => {
0 => ['a'..'z'], # the default
1 => ['A'..'Z'], # first character is always A..Z
1 => ['a'..'z','0'..'9'], # last is q..z,0..9
}
 end
 "start" contains an array reference to all valid ending
characters, e.g. no valid string can end with a character not listed here.
Note that strings of length 1 start and end with their only
character, so the character must be listed in "end" and
"start" to produce a string with one character. Also all
characters that are not followed by any other character are added silently
to the "end" set.
 minlen
 Optional minimum string length. Any string shorter than this will be
invalid. Must be shorter than a (possible defined) maxlen. If not given is
set to inf. Note that the minlen might be adjusted to a greater number,
if it is set to 1 or greater, but there are not valid strings with 2,3
etc. In this case the minlen will be set to the first nonempty class of
the charset.
 maxlen
 Optional maximum string length. Any string longer than this will be
invalid. Must be longer than a (possible defined) minlen. If not given is
set to +inf.
 scale
 Optional input/output scale. See scale().
 copy()

$copy = $charset>copy();
Create a new charset as a copy from an existing one.
 scale()

$scale = $charset>scale();
$charset>scale(120);
Get/set the (optional) scale for all strings. A scale is an integer factor
that will be applied to each as_number() output. Also, all
from_number() will use the scale to modularize the input, e.g.
dividing by the scale, then taking the integer result, and the multiplying
with the scale again.
E.g. for a scale of 3, the string to number mapping would be changed from
the left to the right column:
string form normal number scaled number
'' 0 0
'a' 1 3
'b' 2 6
'c' 3 9
And so on. Input like 8 will be divided by 3, which results in 2 due to
rounding down to the nearest integer, this multiplied by 3 again gives 6.
So:
my $cs = Math::String::Charset>new(['a'..'z']); # a..z
$string = Math::String>new( 'a',$cs ); # a..z
print $string>as_number(); # 1
$cs>scale(3);
print $string>as_number(); # 3
$string = Math::String>from_number(10,$cs); # [10/3] => 3 *3 == 9
 minlen()

$charset>minlen();
Return minimum string length.
 maxlen()

$charset>maxlen();
Return maximum string length.
 length()

$charset>length();
Return the number of items in the charset, for higher order charsets the
number of valid 1character long strings. Shortcut for
"$charset>class(1)".
 count()
 Returns the count of all possible strings described by the charset as a
positive BigInt. Returns 'inf' if no maxlen is defined, because there
should be no upper bound on how many strings are possible. (This might
change if we can calculate an upper bound  not sure if this is possible
with bigrams).
If maxlen is defined, forces a calculation of all possible class()
values and may therefore be very slow on the first call, it also caches
possible lot's of values.
 class()

$charset>class($order);
Return the number of items in a class.
print $charset>class(5); # how many strings with length 5?
 map()

$charset>map($char);
Map a character to it's number, counting from 0 .. N1 where N is the length
of the charset:
$charset = Math::String::Charset>new(['A'..'Z']);
print $charset>map('A'),"\n"; # prints 0
print $charset>map('Z'),"\n"; # prints 25
 char()

$charset>char($nr);
Returns the character number $nr from the set, or undef.
print $charset>char(0); # first char
print $charset>char(1); # second char
print $charset>char(1); # last one
 lowest()

$charset>lowest($length);
Return the number of the first string of length $length. This is equivalent
to (but much faster):
$str = $charset>first($length);
$number = $charset>str2num($str);
 highest()

$charset>highest($length);
Return the number of the last string of length $length. This is equivalent
to (but much faster):
$str = $charset>first($length+1);
$number = $charset>str2num($str);
$number;
 order()

$order = $charset>order();
Return the order of the charset: 1 for simple charsets, 2 (bigrams), 3 etc
for higher orders. See also type().
 type()

$type = $charset>type();
Return the type of the charset: 0 for simple charsets, 1 for grouped ones.
If the type is 0, the order can be 1,23 etc, with type 1 the order is
always 1, too. See also order.
 charlen()

$character_length = $charset>charlen();
Return the length of one character in the set. 1 or greater.
 chars()

$chars = $charset>chars( $bigint );
Returns the number of characters that the string would have, when you would
convert $bigint (Math::BigInt or Math::String object) back to a string.
This is much faster than doing
$chars = length ("$math_string");
since it does not need to actually construct the string.
 first()

$charset>first( $length );
Return the first string with a length of $length, according to the charset.
See "lowest()" for the corrospending number.
 last()

$charset>last( $length );
Return the last string with a length of $length, according to the charset.
See "highest()" for the corrospending number.
 is_valid()

$charset>is_valid();
Check wether a string conforms to the charset set or not. Returns 1 for
okay, 0 for invalid strings.
 norm()

$charset>norm();
Normalize a string by removing separator char at front/end. Does nothing if
no separator is defined.
 error()

$charset>error();
Returns "" for no error or an error message that occured if
construction of the charset failed. Set
$Math::String::Charset::die_on_error to 0 to get the error message,
otherwise the program will die.
 start()

$charset>start();
In list context, returns a list of all characters in the start set, for
simple charsets (e.g. no bi, trigrams etc) simple returns the charset. In
scalar context returns the lenght of the start set.
Note that the returned end set can be differen from what you specified upon
constructing the charset, because characters that are not followed by any
other character will be excluded from the start set (they can't possible
start a string longer than one character).
Think of the start set as the set of all characters that can start a string
with more than one character. The set for one character strings is called
ones and you can access if via "ones()".
 end()

$charset>end();
In list context, returns a list of all characters in the end set, aka all
characters a string can end with. For simple charsets (e.g. no bi,
trigrams etc) simple returns the charset. In scalar context returns the
lenght of the end set.
Note that the returned end set can be differen from what you specified upon
constructing the charset, because characters that are not followed by any
other character will be included in the end set, too.
 ones()

$charset>ones();
In list context, returns a list of all strings consisting of one character,
for simple charsets (e.g. no bi, trigrams etc) simple returns the
charset. In scalar context returns the lenght of the ones set.
This list is the cross of start and end that is calculated
after adding characters with no followers to end, but before
removing the characters with no followers from start.
Think of a string of only one character as if it starts with and ends in
this character at the same time. For instance, if you have the following
definition:
cs = {
start => [ 'a', 'b', 'c', 'q' ],
end => [ 'b', 'c', 'x' ],
bi => {
q => [ ],
a => [ 'b', 'c' ]
b => [ 'a' ]
}
}
The 'q' is not followed by any other character, so it can only end strings.
And since it is not in the end set, it is first added to this set:
cs = {
start => [ 'a', 'b', 'c', 'q' ],
end => [ 'b', 'c', 'x', 'q' ],
bi => {
q => [ ],
a => [ 'b', 'c' ]
b => [ 'a' ]
}
}
Now the cross of "start" and "end" is build. Since only
'b', 'c' and 'q' appear in both "end" and "start",
"ones" consists of:
_ones => [ 'b', 'c', 'q' ]
The order of the chars in "ones" is the same ordering as in
"start".
After this, any character that is not followed by an other character is
removed from "start":
start => [ 'a', 'b', ],
Thus a string with only one character can be 'b', 'c', or 'q', and any
string with more than one character must start with either 'a' or
'b'.
 prev()

$string = Math::String>new( );
$charset>prev($string);
Give the charset and a string, calculates the previous string in the
sequence. This is faster than decrementing the number of the string and
converting the new number to a string. This routine is mainly used
internally by Math::String and updates the cache of the given
Math::String.
 next()

$string = Math::String>new( );
$charset>next($string);
Give the charset and a string, calculates the next string in the sequence.
This is faster than incrementing the number of the string and converting
the new number to a string. This routine is mainly used internally by
Math::String and updates the cache of the given Math::String.
 study()

$hash = Math::String::Charset::study( {
order => $order, words => \@words, sep => 'separator',
charlen => 1, hist => 1 } );
Studies the given list of strings/words and builds a hash that you can use
to construct a charset of. The "order" is 1 for simple charsets,
2 for bigrams and so on. The key "depth" is a synonym for
"order".
"separator" (can be undef) is the sting that separates characters.
"charlen" is the length of a character, and defaults to 1. Use
this if you have characters longer than one and no separator string.
If you set the parameter "hist" to a value different from zero,
the returned hash will contain a key "hist", too. This will be a
reference to a hash containing the histogram of letters or ngrams,
depending on the depth of the analysis.
Some example:
use Math::String::Charset;
use Data::Dumper;
$hash = Math::String::Charset::study( {
depth => 1, words => [ 'hocuspocus'], hist => 1 } );
print Dumper ($hash),"\n";
This will produce (slightly contracted here):
$VAR1 = {
'end' => [ 's' ],
'hist' => { 'u' => '2', 'o' => '2', 'p' => '1', 'h' => '1',
's' => '2', 'c' => '2' },
'chars' => [ 'u', 'o', 's', 'c', 'p', 'h' ],
'start' => [ 'h' ]
};
Using " depth =" 2 >>, you would get (slightly ontracted
again):
$VAR1 = {
'end' => [ 's' ],
'hist' => { 'u' => { 's' => '2' },
'o' => { 'c' => '2' },
'p' => { 'o' => '1' },
'h' => { 'o' => '1' },
's' => { 'p' => '1' },
'c' => { 'u' => '2' }
},
'bi' => {
'u' => [ 's' ],
'o' => [ 'c' ],
'h' => [ 'o' ],
'p' => [ 'o' ],
'c' => [ 'u' ],
's' => [ 'p' ]
},
'start' => [ 'h' ]
};
Instead passing an ARRAY ref as words, you can as well pass a HASH ref. The
keys in the hash will be used as words then. This is so that you can clean
out doubles by using a hash and pass it to study without converting it
back to an array first.
 analyze()
 Is an exportable alias for study().
use Math::String::Charset qw/analyze/;
$hash = Math::String::Charset::analyze(
words => ['Perl','Hacker','Just','Another'], depth => 2,
);
use Math::String::Charset;
# construct a charset from bigram table, and an initial set (containing
# valid startcharacters)
# Note: After an 'a', either an 'b', 'c' or 'a' can follow, in this order
# After an 'd' only an 'a' can follow
# There is no 'q' as start character, but 'q' can follow 'd'!
# You need to define followers for 'q'!
$bi = new Math::String::Charset ( {
start => 'a'..'d',
bi => {
'a' => [ 'b', ],
'b' => [ 'c', 'b' ],
'c' => [ 'a', 'c' ],
'd' => [ 'a', 'q' ],
'q' => [ 'a', 'b' ],
}
} );
print $bi>length(),"\n"; # 4
print scalar $bi>combinations(2),"\n"; # count of combos with 2 chars
# will be 1+2+2+2+2 => 9
my @comb = $bi>combinations(3);
foreach (@comb)
{
print "$_\n";
}
This will print:
4
7
abc
abb
bca
bcc
bbc
bbb
cab
cca
ccc
dab
dqa
dqb
Another example using characters of different lengths to find all combinations
of words in a list:
#!/usr/bin/perl w
# test for Math::String and Math::String::Charset
BEGIN { unshift @INC, '../lib'; }
use Math::String;
use Math::String::Charset;
use strict;
my $count = shift  4000;
my $words = {};
open FILE, 'wordlist.txt' or die "Can't read wordlist.txt: $!\n";
while (<FILE>)
{
chomp; $words>{lc($_)} ++; # clean out doubles
}
close FILE;
my $cs = new Math::String::Charset ( { sep => ' ',
words => $words,
} );
my $string = Math::String>new('',$cs);
print "# Generating first $count strings:\n";
for (my $i = 0; $i < $count; $i++)
{
print ++$string,"\n";
}
print "# Done.\n";
 •
 Currently only bigrams are supported. This should be generic and
arbitrarily deeply nested.
 •
 "study()" does not yet work with separator chars and chars
longer than 1.
 •
 str2num and num2str do not work fully for bigrams yet.
None doscovered yet.
If you use this module in one of your projects, then please email me. I want to
hear about how my code helps you ;)
This module is (C) Copyright by Tels http://bloodgate.com 20002008.