|
|
| |
Multibyte(3) |
User Contributed Perl Documentation |
Multibyte(3) |
String::Multibyte - manipulation of multibyte character strings
use String::Multibyte;
$utf8 = String::Multibyte->new('UTF8');
$utf8_len = $utf8->length($utf8_str);
This module provides some functions which emulate the corresponding
"CORE" functions for locale-independent
manipulation of multiple-byte character strings.
Why this module is locale-independent? Well, because this module
only consider the byte sequence structure of charsets and is not aware of
any Locale stuff! Locale-dependent methods like
"uc()",
"lc()", etc., will not be supported at
all.
The definition files are sited under the directory where
String::Multibyte is sited. E.g. if String::Multibyte is
"perl/site/lib/String/Multibyte.pm", copy
String::Multibyte::Foo as
"perl/site/lib/String/Multibyte/Foo.pm".
The definition file must return a hashref, having key(s) named as
following.
- "charset"
- The value for the key 'charset' stands for a
string of the charset name. In almost case, omission of the
'charset' matters very little, but keep them not
conflict among another charset.
- "regexp"
- The value for the key 'regexp', REQUIRED, is a
regular expression that matchs a single character of charset in question.
(You may use "qr//" if available.)
If the 'regexp' is omitted, calling any
method is croaked.
- "nextchar"
- The value for the key 'nextchar' must be a coderef
that returns the next character to the specified character. If the
'nextchar' coderef is omitted,
"mkrange()" and
"strtr()" methods don't understand
hyphen metacharacter for character ranges.
- "cmpchar"
- The value for the key 'cmpchar' must be a coderef
that compares the specified two characters. If the
'cmpchar' coderef is omitted,
"mkrange" and
"strtr" functions don't understand
reverse character ranges.
- "hyphen"
- The value for the key 'hyphen' is a character to
stand for a character range. The default is
'-'.
- "escape"
- The value for the key 'escape' is an escape
character for a "hyphen" character. The
default is '\\'. The
'escape' character is valid only before a
"hyphen" or another
'escape' (e.g. '\\\\-]'
means '\\' to ']';
'\\\\\-]' means '\\',
'-', and ']'). If an
'escape' character is followed by any character
other than 'escape' or
'hyphen', it is parsed literally.
- "$mbcs = String::Multibyte->new(CHARSET)"
- "$mbcs = String::Multibyte->new(CHARSET, VERBOSE)"
- "CHARSET" is the charset name; exactly
speaking, the file name of the definition file (without the suffix
.pm). It returns the instance to tell methods in which charset the
specified strings should be handled.
"CHARSET" may be a hashref;
this is how to define a charset without any .pm file.
# see perlfaq6 :-)
my $martian = String::Multibyte->new({
charset => "martian",
regexp => '[A-Z][A-Z]|[^A-Z]',
});
If true value is specified as
"VERBOSE", the called method
(excepting "islegal") will check its
arguments and carps if any of them is not legally encoded.
Otherwise such a check won't be carried out (saves a bit of
time, but unsafe, though you can use the
"islegal" method if necessary).
- "$mbcs->islegal(LIST)"
- Returns a boolean indicating whether all the strings in arguments are
legally encoded in the concerned charset. Returns false even if one
element is illegal in "LIST".
- "$mbcs->length(STRING)"
- Returns the length in characters of the specified string.
- "$mbcs->strrev(STRING)"
- Returns a reversed string in characters.
- "$mbcs->index(STRING, SUBSTR)"
- "$mbcs->index(STRING, SUBSTR, POSITION)"
- Returns the position of the first occurrence of
"SUBSTR" in
"STRING" at or after
"POSITION". If
"POSITION" is omitted, starts searching
from the beginning of the string.
If the substring is not found, returns
"-1".
- "$mbcs->rindex(STRING, SUBSTR)"
- "$mbcs->rindex(STRING, SUBSTR, POSITION)"
- Returns the position of the last occurrence of
"SUBSTR" in
"STRING" at or after
"POSITION". If
"POSITION" is specified, returns the
last occurrence at or before that position.
If the substring is not found, returns
"-1".
- "$mbcs->strspn(STRING, SEARCHLIST)"
- Returns returns the position of the first occurrence of any character not
contained in the search list.
$mbcs->strspn("+0.12345*12", "+-.0123456789");
# returns 8.
If the specified string does not contain any character in the
search list, returns 0.
The string consists of characters in the search list, the
returned value equals the length of the string.
"SEARCHLIST" can be an
"ARRAYREF". e.g. if a charset treats
"CRLF" as a single character,
"\r\n" is a one-element list of only
"\r\n". A two-element list of
"\r" and
"\n" can be given as
"["\r", "\n"]" (of
course "\n\r" is also ok since the
character order of "SEARCHLIST"
doesn't matter in "strspn").
- "$mbcs->strcspn(STRING, SEARCHLIST)"
- Returns returns the position of the first occurrence of any character
contained in the search list.
If the specified string does not contain any character in the
search list, the returned value equals the length of the string.
"SEARCHLIST" can be an
"ARRAYREF". e.g. if a charset treats
"CRLF" as a single character,
"\r\n" is a one-element list of only
"\r\n". A two-element list of
"\r" and
"\n" can be given as
"["\r", "\n"]" (of
course "\n\r" is also ok since the
character order of "SEARCHLIST"
doesn't matter in "strcspn").
- "$mbcs->substr(STRING or SCALAR REF, OFFSET)"
- "$mbcs->substr(STRING or SCALAR REF, OFFSET, LENGTH)"
- "$mbcs->substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)"
- It works like "CORE::substr", but using
character semantics of multibyte charset encoding.
If the "REPLACEMENT" as the
fourth argument is specified, replaces parts of the
"SCALAR" and returns what was there
before.
You can utilize the lvalue reference, returned if a reference
of scalar variable is used as the first argument.
${ $mbcs->substr(\$str,$off,$len) } = $replace;
works like
CORE::substr($str,$off,$len) = $replace;
The returned lvalue is not multibyte-aware, then successive
assignment may lead to odd results.
- "$mbcs->strsplit(SEPARATOR, STRING)"
- "$mbcs->strsplit(SEPARATOR, STRING, LIMIT)"
- This function emulates "CORE::split",
but splits on the "SEPARATOR" string,
not by a pattern.
If not in list context, only return the number of fields
found, but does not split into the @_ array.
If empty string is specified as
"SEPARATOR", splits the specified
string into characters.
$bytes->strsplit('', 'This is perl.', 7);
# ('T', 'h', 'i', 's', ' ', 'i', 's perl.')
- "$mbcs->mkrange(CHARLIST, ALLOW_REVERSE)"
- Returns the character list (not in list context, as a concatenated string)
gained by parsing the specified character range.
The result depends on the the character order for the
concerned charset. About the character order for each charset, see its
definition file.
If the character order is undefined in the definition file,
returns an identical string with the specified string.
A character range is specified with a hyphen
('-', but exactly speaking,
"$obj->{hyphen}").
The backslashed combinations '\-' and
'\\' (exactly speaking,
"$obj->{escape}$obj->{hyphen}"
and
"$obj->{escape}$obj->{escape}")
are used instead of the characters '-' and
'\', respectively. The hyphen at the beginning
or the end of the range is also evaluated as the hyphen itself.
For example,
"$mbcs->mkrange('+\-0-9A-F')"
returns "('+', '-', '0', '1', '2', '3', '4', '5',
'6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E',
'F')" and "scalar
$mbcs->mkrange('A-P')" returns
'ABCDEFGHIJKLMNOP'.
If true value is specified as the second argument, reverse
character ranges such as '9-0',
'Z-A' are allowed.
$bytes = String::Multibyte->new('Bytes');
$bytes->mkrange('p-e-r-l', 1); # ponmlkjihgfefghijklmnopqrqponml
- "$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST,
REPLACEMENTLIST)"
- "$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST,
MODIFIER)"
- Transliterates all occurrences of the characters found in the search list
with the corresponding character in the replacement list.
If a reference of scalar variable is specified as the first
argument, returns the number of characters replaced or deleted;
otherwise, returns the transliterated string and the specified string is
unaffected.
If 'h' modifier is specified, returns
a hash of histogram in list context; a reference to hash of histogram in
scalar context;
SEARCHLIST and REPLACEMENTLIST
Character ranges (internally utilizing
"mkrange()") are supported.
If the "REPLACEMENTLIST" is
empty (specified as '', not
"undef", because the use of
uninitialized value causes warning under -w option), the
"SEARCHLIST" is replicated.
If the replacement list is shorter than the search list, the
final character in the replacement list is replicated till it is long
enough (but differently works when the 'd' modifier is used).
"SEARCHLIST" and
"REPLACEMENTLIST" can be an
"ARRAYREF". e.g. if a charset treats
"\r\n"
("CRLF") as a single character,
"\r\n" is a one-element list of only
"\r\n". A two-element list of
"\r" and
"\n" should be given as
"["\r", "\n"]". Of
course "\n\r" is also ok but the
character order is different; cf. "strtr($str,
["\r", "\n"], ["\n",
"\r"])" that swaps
"\n" and
"\r".
Each elements of "ARRAYREF"
can include character ranges (the modifiers
"R" and
"r" affect their evaluation as
usual).
"["A-C",
"h-z"]" is evaluated like
"A-Ch-z" if
"charset" does not include grapheme
"Ch". The former prevents
"C" and
"h" from evaluation as
"Ch" even if the
"charset" included grapheme
"Ch".
MODIFIER
c Complement the SEARCHLIST.
d Delete found but unreplaced characters.
s Squash duplicate replaced characters.
h Return a hash (or a hashref) of histogram.
R No use of character ranges.
r Allows to use reverse character ranges.
o Caches the conversion table internally.
If 'R' modifier is specified,
'-' is not evaluated as a meta character but
hyphen itself like in "tr'''".
Compare:
$mbcs->strtr("90 - 32 = 58", "0-9", "A-J");
# output: "JA - DC = FI"
$mbcs->strtr("90 - 32 = 58", "0-9", "A-J", "R");
# output: "JA - 32 = 58"
# cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J';
# '0' to 'A', '-' to '-', and '9' to 'J'.
If 'r' modifier is specified, reverse
character ranges are allowed. e.g.
$mbcs->strtr($str, "0-9", "9-0", "r")
is equivalent to
$mbcs->strtr($str, "0123456789", "9876543210")
Caching the conversion table
If 'o' modifier is specified, the
conversion table is cached internally. e.g.
foreach (@source_strings) {
print $mbcs->strtr($_, $from_list, $to_list, 'o');
}
will be almost as efficient as this:
$trans = $mbcs->trclosure($from_list, $to_list);
foreach (@source_strings) {
print &$trans($_);
}
You can use whichever you like.
Without 'o',
foreach (@source_strings) {
print $mbcs->strtr($_, $from_list, $to_list);
}
will be very slow since the conversion table is made whenever
the function is called.
- "$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST)"
- "$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST,
MODIFIER)"
- Returns a closure to transliterate the specified string. The return value
is an only code reference, not blessed object. By use of this code ref,
you can save yourself time as you need not specify arguments every time.
my $trans = $mbcs->trclosure($from_list, $to_list);
print &$trans ($string); # ok to perl 5.003
print $trans->($string); # perl 5.004 or better
The functionality of the closure made by
"trclosure()" is equivalent to that of
"strtr()". Frankly speaking, the
"strtr()" calls
"trclosure()" internally and uses the
returned closure.
"SEARCHLIST" and
"REPLACEMENTLIST" can be an
"ARRAYREF" same as
"strtr()".
- $[
- This modules supposes $[ is always equal to
0, never 1.
- Grapheme manipulation
- Since v. 1.01, manipulation of sequence of graphemes is to be supported.
In a grapheme-aware manipulation, notice that the beginning
and the end of a string always lie on a grapheme boundary.
E.g. imagine a grapheme set where a grapheme comprises either
a leading latin capital letter followed by one or more latin small
letters, or a single byte. Such a set can be define as below.
$gra = String::Multibyte->new({
regexp => '[A-Z][a-z]*|[\x00-\xFF]',
});
Think about
"$gra->index("Perl",
"Pe")". As both
"Perl" and
"Pe" are a single grapheme, they are
not equal to each other. So the result of this must be
"-1" (meaning no match).
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2001-2015, SADAHIRO Tomoyuki. Japan. All rights
reserved.
This module is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |