GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  LINGUA::ZH::TOKE (3)

.ds Aq ’

NAME

Lingua::ZH::Toke - Chinese Tokenizer

CONTENTS

VERSION

This document describes version 0.02 of Lingua::ZH::Toke, released January 11, 2004.

SYNOPSIS



    use Lingua::ZH::Toke;

    # -- if inputs are unicode strings, use the two lines below instead
    # use utf8;
    # use Lingua::ZH::Toke utf8;

    # Create Lingua::ZH::Toke::Sentence object (->Sentence also works)
    my $token = Lingua::ZH::Toke->new( XXXX/XXXXX/XXXXXX );

    # Easy tokenization via array deferencing
    print $token->[0]           # Fragment       - XXXX
                ->[2]           # Phrase         - XX
                ->[0]           # Character      - X
                ->[0]           # Pronounciation - XXXX
                ->[2];          # Phonetic        - X

    # Magic histogram via hash deferencing
    print $token->{XXXX};     # 1 - One such fragment there
    print $token->{XXXX};     # 1 - One such phrase there
    print $token->{XXXX};     # undef - Thats not a phrase
    print $token->{X};        # 2 - Two such character there
    print $token->{XX};       # 2 - Two such pronounciation: XX
    print $token->{X};        # 3 - Three such phonetics: XXX

    # Iteration over fragments
    while (my $fragment = <$token>) {
        # Iteration over phrases
        while (my $phrase = <$fragment>) {
            # ...
        }
    }



DESCRIPTION

This module puts a thin wrapper around Lingua::ZH::TaBE, by blessing refereces to <B>TaBEB>’s objects into its English counterparts.

Besides offering more readable class names, this module also offers various overloaded methods for tokenization; please see SYNOPSIS for the three major ones.

Since Lingua::ZH::TaBE is a Big5-oriented module, we also provide a simple utf8 layer around it; if you have Perl version 5.6.1 or later, just use this:



    use utf8;
    use Lingua::ZH::Toke utf8;



With the utf8 flag set, all <B>TokeB> objects will stringify to unicode strings, and constructors will take either unicode strings, or big5-encoded bytestrings.

Note that on Perl 5.6.x, Encode::compat is needed for the utf8 feature to work.

METHODS

The constructor methods correspond to the six object levels: ->Sentence, ->Fragment, ->Phrase, ->Character, ->Pronounciation and ->Phonetic. Each of them takes one string argument, representing the string to be tokenized.

The ->new method is an alias to ->Sentence>.

All object methods, except ->new, are passed to the underlying <B>Lingua::ZH::TaBEB> object.

CAVEATS

Under utf8 mode, you may sometimes need to explicitly stringify the return values, so their utf8 flag can be properly set:



    $value = $token->[0];       # this may or may not work
    $value = "$token->[0]";     # this is guaranteed to work



This module does not care about efficiency or memory consumption yet, hence it’s likely to fail miserably if you demand either of them. Patches welcome.

As the name suggests, the chosen interface is very bizzare. Use it at the risk of your own sanity.

SEE ALSO

Lingua::ZH::TaBE, Encode::compat, Encode

AUTHORS

Autrijus Tang <autrijus@autrijus.org>

COPYRIGHT

Copyright 2003, 2004 by Autrijus Tang <autrijus@autrijus.org>.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See <http://www.perl.com/perl/misc/Artistic.html>

Search for    or go to Top of page |  Section 3 |  Main Index


perl v5.20.3 LINGUA::ZH::TOKE (3) 2004-01-11

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.