|o||ASCII characters (single bytes in the range 0x00 - 0x7F) are passed through unchanged.|
|o||Well-formed UTF-8 multi-byte characters are also passed through unchanged.|
|o||UTF-8 multi-byte character which are over-long but otherwise well-formed are converted to the shortest UTF-8 normal form.|
|o||Bytes in the range 0xA0 - 0xFF are assumed to be Latin-1 characters (ISO8859-1 encoded) and are converted to UTF-8.|
|o||Bytes in the range 0x80 - 0x9F are assumed to be Win-Latin-1 characters (CP1252 encoded) and are converted to UTF-8. Except for the five bytes in this range which are not defined in CP1252 (see the ascii_hex option below).|
If you pass in a string that is already a UTF-8 character string (the utf8 flag is set on the Perl scalar) then the string will simply be returned unchanged. However if the bytes_only option is specified (see below), the returned string will be a byte string rather than a character string. The rules described above will not be applied in either case.
The fix_latin function accepts options as name => value pairs. Recognised options are:
|bytes_only => 1/0||The value returned by fix_latin is normally a Perl character string and will have the utf8 flag set if it contains non-ASCII characters. If you set the bytes_only option to a true value, the returned string will be a binary string of UTF-8 bytes. The utf8 flag will not be set. This is useful if youre going to immediately use the string in an IO operation and wish to avoid the overhead of converting to and from Perls internal representation.|
|ascii_hex => 1/0||
Bytes in the range 0x80-0x9F are assumed to be CP1252, however CP1252 does not
define a mapping for 5 of these bytes (0x81, 0x8D, 0x8F, 0x90 and 0x9D). Use
this option to specify how they should be handled:
When processing text strings you will almost certainly never encounter these bytes at all. The most likely reason you would see them is if a malicious attacker was feeding random bytes to your application. It is difficult to conceive of a scenario in which it makes sense to change this option from its default setting.
|overlong_fatal => 1/0||
An over-long UTF-8 byte sequence is one which uses more than the minimum number
of bytes required to represent the character. Use this option to specify how
overlong sequences should be handled.
There is a strong argument that overlong sequences are only ever encountered in malicious input and therefore they should always be rejected.
|use_xs => auto | always | never||
This option controls whether or not the XS (compiled C) implementation of
fix_latin is used. Note, the Encoding::FixLatin::XS module must be
installed separately. The three possible values for this option are:
This module is perfectly safe when handling data containing only ASCII and UTF-8 characters. Introducing ISO8859-1 or CP1252 characters does add a risk of data corruption (ie: some characters in the input being converted to incorrect characters in the output). To quantify the risk it is necessary to understand its cause. First, lets break the input bytes into two categories.
A sequence of ASCII bytes (aaa) is always unambiguous and will not be misinterpreted.
o ASCII bytes fall into the range 0x00-0x7F - the most significant bit is always set to zero. Ill use the symbol a to represent these bytes. o Non-ASCII bytes fall into the range 0x80-0xFF - the most significant bit is always set to one. Ill use the symbol B to represent these bytes.
The potential for error occurs with two (or more) consecutive non-ASCII bytes. For example the sequence BB might be intended to represent two characters in one of the legacy encodings or a single character in UTF-8. Because this module gives precedence to the UTF-8 characters it is possible that a random pair of legacy characters may be misinterpreted as a single UTF-8 character.
The risk is reduced by the fact that not all pairs of non-ASCII bytes form valid UTF-8 sequences. Every non-ASCII UTF-8 character is made up of two or more B bytes and no a bytes. For a two-byte character, the first byte must be in the range 0xC0-0xDF and the second must be in the range 0x80-0xBF.
Any pair of BB bytes that do not fall into the required ranges are unambiguous and will not be misinterpreted.
Pairs of BB bytes that are actually individual Latin-1 characters but happen to fall into the required ranges to be misinterpreted as a UTF-8 character are rather unlikely to appear in normal text. If you look those ranges up on a Latin-1 code chart youll see that the first character would need to be an uppercase accented letter and the second would need to be a non-printable control character or a special punctuation symbol.
One way to summarise the role of this module is that it guarantees to produce UTF-8 output, possibly at the cost of introducing the odd typo.
Please report any bugs to bug-encoding-fixlatin at rt.cpan.org, or through the web interface at <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Encoding-FixLatin>. I will be notified, and then youll automatically be notified of progress on your bug as I make changes.
You can also look for information at:
o Issue tracker
o AnnoCPAN: Annotated CPAN documentation o CPAN Ratings o Search CPAN o Source code repository
Copyright 2009-2014 Grant McLean <firstname.lastname@example.org>
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
|perl v5.20.3||ENCODING::FIXLATIN (3)||2014-05-22|