GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
Text::Language::Guess(3) User Contributed Perl Documentation Text::Language::Guess(3)

Text::Language::Guess - Trained module to guess a document's language

    use Text::Language::Guess;

    my $guesser = Text::Language::Guess->new();
    my $lang = $guesser->language_guess("bill.txt");

        # prints 'en'
    print "Best fit: $lang\n";

Text::Language::Guess guesses a document's language. Its implementation is simple: Using "Text::ExtractWords" and "Lingua::StopWords" from CPAN, it determines how many of the known stopwords the document contains for each language supported by "Lingua::StopWords".

Each word in the document recognized as stopword of a particular language scores one point for this language.

The "language_guess()" function takes a document as a parameter and returns the abbreviation of the language that it is most likely written in.

Supported Languages:

  • English (en)
  • French (fr)
  • Spanish (es)
  • Portugese (pt)
  • Italian (it)
  • German (de)
  • Dutch (nl)
  • Swedish (sv)
  • Norwegian (no)
  • Danish (da)

"new()"
Initializes the guesser with all stopwords available for all supported languges. If "new" has been called before, subsequent calls will return the same precomputed stoplist map, avoiding collecting all stopwords again (as long as the number of languages stays the same, see next paragraph).

You can limit the number of searched languages by specifying the "language" parameter and passing it an array ref of wanted languages:

        # Only guess between English and German
    $guesser = Text::Language::Guess->new(languages => ['en', 'de']);
    
"language_guess($textfile)"
Reads in a text file, extracts all words, scores them using the stopword maps and returns a single two-letter string indicating the language the document is most likely written in.
"language_guess_string($string)"
Just like "language_guess", but takes a string instead of a file name.
"scores($textfile)"
Like "language_guess($textfile)", just returning a ref to a hash mapping language strings (e.g. 'en') to a score number. The entry with the highest score is the most likely one.
"scores_string($string)"
Like "scores", but takes a string instead of a file name.

    use Text::Language::Guess;

        # Guess language in a string instead of a file
    my $guesser = Text::Language::Guess->new();
    my $lang = $guesser->language_guess_string("Make love not war");
        # 'en'


        # Limit number of languages to choose from
    my $guesser = Text::Language::Guess->new(languages => ['da', 'nl']);
    my $lang = $guesser->language_guess_string(
                   "Which is closer to English, danish or dutch?");
        # 'nl'


        # Show different scores
    my $guesser = Text::Language::Guess->new();
    my $scores = $guesser->scores_string(
        "This text is English, but other languages are scoring as well");
    use Data::Dumper;
    print Dumper($scores);

        # $VAR1 = {
        #   'pt' => 1,
        #   'en' => 6,
        #   'fr' => 1,
        #   'nl' => 1
        # };

Copyright 2005 by Mike Schilli, all rights reserved. This program is free software, you can redistribute it and/or modify it under the same terms as Perl itself.

2005, Mike Schilli <cpan@perlmeister.com>
2005-11-20 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.