GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
Text::Unaccent::PurePerl(3) User Contributed Perl Documentation Text::Unaccent::PurePerl(3)

Text::Unaccent::PurePerl - remove accents from characters

  use Text::Unaccent::PurePerl qw(unac_string);

  $unaccented = unac_string($string);

  # For compatibility with Text::Unaccent, and
  # for dealing with strings of raw octets:

  $unaccented = unac_string($charset, $octets);
  $unaccented = unac_string_utf16($octets);

  # For compatibility with Text::Unaccent, but
  # have no useful purpose in this module.
  $version = unac_version();
  unac_debug($level);

Text::Unaccent::PurePerl is a module for “unaccenting” characters, i.e., removing accents and other diacritic marks from characters. Here, the term unaccenting has a rather loose meaning, since this module does a lot more than just removing accents. Here are some examples:

  Á → A    latin letter
  Æ → AE   single letter split in two
  ƒ → f    simpler variant of same letter
  IJ → IJ   ligature split in two
  ¹ → 1    superscript
  ½ → 1⁄2  fraction
  ώ → ω    Greek letter
  Й → И    Cyrillic letter
  ™ → TM   various symbols

Text::Unaccent::PurePerl is a pure Perl equivalent to the Text::Unaccent module, but with the additional feature of handling modern Perl character strings. Text::Unaccent only deals with raw octet strings with an associated character coding. In addition, this module, as the name suggests, does not require a C compiler to build. The disadvantage is that this module is slower than Text::Unaccent.

The conversions done by Text::Unaccent seem inconsistent. For instance,

  Æ → AE   Text::Unaccent will convert this ...
  Œ → OE   ... but not this

One might expect the following conversions

  … → ...  U+2026 HORIZONTAL ELLIPSIS
  Œ → OE   U+0152 LATIN CAPITAL LIGATURE OE
  œ → oe   U+0153 LATIN SMALL LIGATURE OE
  ′ → '    U+2032 PRIME
  ″ → "    U+2033 DOUBLE PRIME

and more, but these aren't implemented in Text::Unaccent, so they aren't implemented in Text::Unaccent::PurePerl either. This might change in the future.

If you want a full transliteration to ASCII, use the Text::Unidecode module.

  "Русский" (input)
  "Русскии" (output from Text::Unaccent::PurePerl::unac_string)
  "Russkii" (output from Text::Unidecode::unidecode)

  "Ελληνικά" (input)
  "Ελληνικα" (output from Text::Unaccent::PurePerl::unac_string)
  "Ellinika" (output from Text::Unidecode::unidecode)

Functions exported by default: "unac_string", "unac_string_utf16", "unac_version", and "unac_debug".

unac_string CHARACTER_STRING
unac_string ENCODING, OCTET_STRING
Return the unaccented equivalent to the input string. The one-argument version assumes the input is a Perl string, i.e., a sequence of characters. (A character is in the range 0...(2**32-1), or more).

The two-argument version assumes the input is a sequence of octets, i.e., raw, encoded data. (An octet is eight bits of data with ordinal value in the range 0...255.) It is essentially equivalent to the following unaccent() function

  use Text::Unaccent;
  use Encode;

  sub unaccent {
      ($enc, $oct) = @_;
      encode($enc, unac_string(decode($enc, $oct)));
  }
    
unac_string_utf16 OCTET_STRING
This function is mainly provided for compatibility with Text::Unaccent. It is equivalent to

    unac_string("UTF-16BE", OCTET_STRING);
    
unac_version
This function is provided only for compatibility with Text::Unaccent. It returns the version of this module.
unac_debug LEVEL
This function is provided only for compatibility with Text::Unaccent. It has no effect on the behaviour of this module.

  $str1 = "déjà vu";
  $str2 = unac_string($str1);
  #     = "deja vu";

  $str1 = "νέα";
        = "\x{03BD}\x{03AD}\x{03B1}";

  $str2 = unac_string($str1);
  #     = "νεα";
  #     = "\x{03BD}\x{03B5}\x{03B1}";

The unaccented string $str2 is made up by the three letters epsilon (without the tonos), nu, and alpha.

In contrast, the version of unac_string() in the Text::Unaccent module gives

  $oct2 = unac_string("UTF-8", $str1);
  #     = "\xCE\xBD\xCE\xB5\xCE\xB1";

These octets are the UTF-8 encoded equivalent of "\x{03BD}\x{03B5}\x{03B1}".

There are currently no known bugs.

Please report any bugs or feature requests to "bug-text-unaccent-pureperl at rt.cpan.org", or through the web interface at <http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Unaccent-PurePerl>. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

You can find documentation for this module with the perldoc command.

    perldoc Text::Unaccent::PurePerl

You can also look for information at:

  • RT: CPAN's request tracker

    <http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-Unaccent-PurePerl>

  • AnnoCPAN: Annotated CPAN documentation

    <http://annocpan.org/dist/Text-Unaccent-PurePerl>

  • CPAN Ratings

    <http://cpanratings.perl.org/d/Text-Unaccent-PurePerl>

  • Search CPAN

    <http://search.cpan.org/dist/Text-Unaccent-PurePerl/>

  • CPAN PASS Matrix

    <http://www.cpantesters.org/stats/dist/Text-Unaccent-PurePerl.html>

Text::Unaccent(3).

Peter John Acklam, <pjacklam@online.no>

Copyright 2008,2013 Peter John Acklam.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

2013-03-02 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.