 |
|
| |
String(3) |
User Contributed Perl Documentation |
String(3) |
Unicode::String - String of Unicode characters (UTF-16BE)
use Unicode::String qw(utf8 latin1 utf16be);
$u = utf8("string");
$u = latin1("string");
$u = utf16be("\0s\0t\0r\0i\0n\0g");
print $u->utf32be; # 4 byte characters
print $u->utf16le; # 2 byte characters + surrogates
print $u->utf8; # 1-4 byte characters
A "Unicode::String" object
represents a sequence of Unicode characters. Methods are provided to convert
between various external formats (encodings) and
"Unicode::String" objects, and methods are
provided for common string manipulations.
The functions utf32be(), utf32le(),
utf16be(), utf16le(), utf8(), utf7(),
latin1(), uhex(), uchr() can be imported from the
"Unicode::String" module and will work as
constructors initializing strings of the corresponding encoding.
The "Unicode::String" objects
overload various operators, which means that they in most cases can be
treated like plain strings.
Internally a "Unicode::String"
object is represented by a string of 2 byte numbers in network byte order
(big-endian). This representation is not visible by the API provided, but it
might be useful to know in order to predict the efficiency of the provided
methods.
The following class methods are available:
- Unicode::String->stringify_as
- Unicode::String->stringify_as(
$enc )
- This method is used to specify which encoding will be used when
"Unicode::String" objects are implicitly
converted to and from plain strings.
If an argument is provided it sets the current encoding. The
argument should have one of the following: "ucs4",
"utf32", "utf32be", "utf32le",
"ucs2", "utf16", "utf16be",
"utf16le", "utf8", "utf7",
"latin1" or "hex". The default is
"utf8".
The stringify_as() method returns a reference to the
current encoding function.
- $us = Unicode::String->new
- $us = Unicode::String->new( $initial_value )
- This is the object constructor. Without argument, it creates an empty
"Unicode::String" object. If an
$initial_value argument is given, it is decoded
according to the specified stringify_as() encoding, UTF-8 by
default.
In general it is recommended to import and use one of the
encoding specific constructor functions instead of invoking this
method.
These methods get or set the value of the
"Unicode::String" object by passing
strings in the corresponding encoding. If a new value is passed as argument
it will set the value of the
"Unicode::String", and the previous value
is returned. If no argument is passed then the current value is
returned.
To illustrate the encodings we show how the 2 character sample
string of "µm" (micro meter) is encoded for each one.
- $us->utf32be
- $us->utf32be( $newval )
- The string passed should be in the UTF-32 encoding with bytes in big
endian order. The sample "µm" is
"\0\0\0\xB5\0\0\0m" in this encoding.
Alternative names for this method are utf32() and
ucs4().
- $us->utf32le
- $us->utf32le( $newval )
- The string passed should be in the UTF-32 encoding with bytes in little
endian order. The sample "µm" is is
"\xB5\0\0\0m\0\0\0" in this encoding.
- $us->utf16be
- $us->utf16be( $newval )
- The string passed should be in the UTF-16 encoding with bytes in big
endian order. The sample "µm" is "\0\xB5\0m" in
this encoding.
Alternative names for this method are utf16() and
ucs2().
If the string passed to utf16be() starts with the
Unicode byte order mark in little endian order, the result is as if
utf16le() was called instead.
- $us->utf16le
- $us->utf16le( $newval )
- The string passed should be in the UTF-16 encoding with bytes in little
endian order. The sample "µm" is is "\xB5\0m\0"
in this encoding. This is the encoding used by the Microsoft Windows API.
If the string passed to utf16le() starts with the
Unicode byte order mark in big endian order, the result is as if
utf16le() was called instead.
- $us->utf8
- $us->utf8( $newval )
- The string passed should be in the UTF-8 encoding. The sample
"µm" is "\xC2\xB5m" in this encoding.
- $us->utf7
- $us->utf7( $newval )
- The string passed should be in the UTF-7 encoding. The sample
"µm" is "+ALU-m" in this encoding.
The UTF-7 encoding only use plain US-ASCII characters for the
encoding. This makes it safe for transport through 8-bit stripping
protocols. Characters outside the US-ASCII range are base64-encoded and
'+' is used as an escape character. The UTF-7 encoding is described in
RFC 1642.
If the (global) variable
$Unicode::String::UTF7_OPTIONAL_DIRECT_CHARS is
TRUE, then a wider range of characters are encoded as themselves. It is
even TRUE by default. The characters affected by this are:
! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
- $us->latin1
- $us->latin1( $newval )
- The string passed should be in the ISO-8859-1 encoding. The sample
"µm" is "\xB5m" in this encoding.
Characters outside the "\x00" .. "\xFF"
range are simply removed from the return value of the latin1()
method. If you want more control over the mapping from Unicode to
ISO-8859-1, use the "Unicode::Map8"
class. This is also the way to deal with other 8-bit character sets.
- $us->hex
- $us->hex( $newval )
- The string passed should be plain ASCII where each Unicode character is
represented by the "U+XXXX" string and separated by a single
space character. The "U+" prefix is optional when setting the
value. The sample "µm" is "U+00b5 U+006d" in
this encoding.
The following methods are available:
- $us->as_string
- Converts a "Unicode::String" to a plain
string according to the setting of stringify_as(). The default
stringify_as() encoding is "utf8".
- $us->as_num
- Converts a "Unicode::String" to a
number. Currently only the digits in the range 0x30 .. 0x39 are
recognized. The plan is to eventually support all Unicode digit
characters.
- $us->as_bool
- Converts a "Unicode::String" to a
boolean value. Only the empty string is FALSE. A string consisting of only
the character U+0030 is considered TRUE, even if Perl consider
"0" to be FALSE.
- $us->repeat( $count )
- Returns a new "Unicode::String" where
the content of $us is repeated
$count times. This operation is also overloaded
as:
$us x $count
- $us->concat( $other_string )
- Concatenates the string $us and the string
$other_string. If
$other_string is not an
"Unicode::String" object, then it is
first passed to the Unicode::String->new constructor function. This
operation is also overloaded as:
$us . $other_string
- $us->append( $other_string )
- Appends the string $other_string to the value of
$us. If $other_string is
not an "Unicode::String" object, then it
is first passed to the Unicode::String->new constructor function. This
operation is also overloaded as:
$us .= $other_string
- $us->copy
- Returns a copy of the current
"Unicode::String" object. This operation
is overloaded as the assignment operator.
- $us->length
- Returns the length of the
"Unicode::String". Surrogate pairs are
still counted as 2.
- $us->byteswap
- This method will swap the bytes in the internal representation of the
"Unicode::String" object.
Unicode reserve the character U+FEFF character as a byte order
mark. This works because the swapped character, U+FFFE, is reserved to
not be valid. For strings that have the byte order mark as the first
character, we can guaranty to get the byte order right with the
following code:
$ustr->byteswap if $ustr->ord == 0xFFFE;
- $us->unpack
- Returns a list of integers each representing an UCS-2 character code.
- $us->pack( @uchr )
- Sets the value of $us as a sequence of UCS-2
characters with the characters codes given as parameter.
- $us->ord
- Returns the character code of the first character in
$us. The ord() method deals with surrogate
pairs, which gives us a result-range of 0x0 .. 0x10FFFF. If the
$us string is empty, undef is returned.
- $us->chr( $code )
- Sets the value of $us to be a string containing
the character assigned code $code. The argument
$code must be an integer in the range 0x0 ..
0x10FFFF. If the code is greater than 0xFFFF then a surrogate pair
created.
- $us->name
- In scalar context returns the official Unicode name of the first character
in $us. In array context returns the name of all
characters in $us. Also see
Unicode::CharName.
- $us->substr( $offset )
- $us->substr( $offset, $length )
- $us->substr( $offset, $length, $subst )
- Returns a sub-string of $us. Works similar to the
builtin substr() function.
- $us->index( $other )
- $us->index( $other, $pos )
- Locates the position of $other within
$us, possibly starting the search at position
$pos.
- $us->chop
- Chops off the last character of $us and returns it
(as a "Unicode::String" object).
The following functions are provided. None of these are exported
by default.
- byteswap2( $str,
... )
- This function will swap 2 and 2 bytes in the strings passed as arguments.
If this function is called in void context, then it will modify its
arguments in-place. Otherwise, the swapped strings are returned.
- byteswap4( $str,
... )
- The byteswap4 function works similar to byteswap2, but will reverse the
order of 4 and 4 bytes.
- latin1( $str )
- utf7( $str )
- utf8( $str )
- utf16le( $str
)
- utf16be( $str
)
- utf32le( $str
)
- utf32be( $str
)
- Constructor functions for the various Unicode encodings. These return new
"Unicode::String" objects. The provided
argument should be encoded correspondingly.
- uhex( $str )
- Constructs a new "Unicode::String"
object from a string of hex values. See hex() method above for
description of the format.
- uchar( $num )
- Constructs a new one character
"Unicode::String" object from a Unicode
character code. This works similar to perl's builtin chr()
function.
Unicode::CharName, Unicode::Map8
<http://www.unicode.org/>
perlunicode
Copyright 1997-2000,2005 Gisle Aas.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
Hey! The above document had some coding errors, which are
explained below:
- Around line 601:
- Non-ASCII character seen before =encoding in '"µm"'.
Assuming CP1252
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
|