![]() |
![]()
| ![]() |
![]()
NAMEFEAR::API - Web Scraping ZenSYNOPSISFEAR = ∑( WWW Crawler, Data Extractor, Data Munger, (X|HT)ML Parser, ...... , Yucky Overloading ) = ∞ = ☯ = 禪 DESCRIPTIONFEAR::API is a tool that helps reduce your time creating site scraping scripts and help you do it in a much more elegant way. FEAR::API combines many strong and powerful features from various CPAN modules, such as LWP::UserAgent, WWW::Mechanize, Template::Extract, Encode, HTML::Parser, etc. and digests them into a deeper Zen.However, this module violates probably every single rule of any Perl coding standards. Please stop here if you don't want to see the yucky code. This module was originated from a short-term project. I was asked to extract data from several commercial websites. During my development, I found many redundancies of code, and I attempted to reduce the code size and to create something that is specialized to do this job: Site Scraping. (Or Web Scraping, Screen Scraping). Before creating this module, I have surveyed some site scrapers or information extraction tools, and none of them could really satisfy my needs. I meditated on what the my ideal tool should be shaped like, and the ideas gradually got solidified in my mind. Then I created FEAR::API. It is a highly specialized module with a domain-specific syntax. Maybe you are used to creating browser emulator using WWW::Mechanize, but you need to write some extra code to parse the content. Sometimes, after you have extracted data from documents, you also need to write some extra code to store them into databases or plain text files. It may be very easy for you, but is not always done quickly. That's why FEAR::API is here. FEAR::API encapsulates necessary components in any site scraping flow, trying to help you speed up the whole process. THE FIVE ELEMENTSThere are 5 essential elements in this module.FEAR::API::Agent FEAR::API::Document FEAR::API::Extract FEAR::API::Filter FEAR::API FEAR::API::Agent is the crawler component. It fetches web pages, and passes contents to FEAR::API::Document. FEAR::API::Document stores fetched documents. FEAR::API::Extract performs data extraction on documents. FEAR::API::Filter does pre-processing on documents and post-processing on extracted results. This component let you clean up fetched pages and refine extracted results. FEAR::API is the public interface, and everything is handled and coordinated internally in it. Generally, you interact only with this package, and it is supposed to solve most of your problems. The architecture is not complicated. I guess, the most bewildering thing may be the over-simplified syntax. According to some users who have already tried some of the example codes, they still have completely no idea about what's really going on with this module. After having done parallel prefetching based on Larbin, I decided to start my documentation. (And I started to regret a little bit that I created this module.) USAGEThe first lineuse FEAR::API -base; To -base, or not to -base. That is no question. Using FEAR::API with -base means your current package is a subclass of FEAR::API, and $_ is auto-initiated as a FEAR::API object. Using it without -base is like using any other OO Perl modules. You need to do instantiation by yourself, and specify the object with each method call. use strict; use FEAR::API; my $f = fear(); $f->url("blah"); # blah, blah, blah..... Fetch a pageurl("google.com"); fetch(); FEAR::API maintains a URL queue in itself. Everytime you call url(), it pushes your arguments to the queue, and when you call fetch(), the URL at the front will be poped and be requested. If the request is successful, the fetched document will be stored in FEAR::API::Document. fetch() not only pops the top element in the queue, but also takes arguments. If you pass a URL to fetch(), FEAR::API will fetch the one you specify, and ignore the URL quque temporarily. Fetch a page and store it in a scalar fetch("google.com") > my $content; my $content = fetch("google.com")->document->as_string; Fetch a page and print to STDOUT getprint("google.com"); print fetch("google.com")->document->as_string; fetch("google.com"); print $$_; fetch("google.com") | _print; Fetch a page and save it to a file getstore("google.com", 'google.html'); url("google.com")->() | _save_as("google.html"); fetch("google.com") | io('google.html'); Dispatch LinksDeal with links in a web page (I)Once you have a page fetched, you will probably need to process the links in this page. FEAR::API provides a method dispatch_links() (or report_links()) designed to do this job. dispatch_links() takes a list of pairs of (regular expression => action). For each link in the page, if it matches a certain regular expression (or, say rule), then the action will be taken. You can also set fallthrough_report(1) to test all the rules. >> is overloaded. It is equivalent to method dispatch_links() or report_links(). fallthrough_report() is automatically set to 1 if >> is followed by an array ref [], and 0 if >> is followed by an hash ref {}. In the following code examples, a constant _self is used with rules, which means links that matches rules will be all pushed back to the URL queue. Verbose fetch("http://google.com") ->report_links( qr(^http:) => _self, qr(google) => \my @l, qr(google) => sub { print ">>>".$_[0]->[0],$/ } ); fetch while has_more_urls; print Dumper \@l; Minimal url("google.com")->() >> [ qr(^http:) => _self, qr(google) => \my @l, qr(google) => sub { print ">>>".$_[0]->[0],$/ } ]; $_->() while $_; print Dumper \@l; Equivalent Code url("tw.yahoo.com")->(); my @l; foreach my $link (links){ $link->[0] =~ /^http:/ and url($link) and next; $link->[0] =~ /tw.yahoo/ and push @l, $link and next; $link->[0] =~ /tw.yahoo/ and print ">>>".$link->[0],$/ and next; } fetch while has_more_links; print Dumper \@l; Deal with links in a web page (II) Verbose fetch("http://google.com") ->fallthrough_report(1) ->report_links( qr(^http:) => _self, qr(google) => \my @l, qr(google) => sub { print ">>>".$_[0]->[0],$/ } ); fetch while has_more_urls; print Dumper \@l; Minimal url("google.com")->() >> { qr(^http:) => _self, qr(google) => \my @l, qr(google) => sub { print ">>>".$_[0]->[0],$/ } }; $_->() while $_; print Dumper \@l; Equivalent Code url("tw.yahoo.com")->(); my @l; foreach my $link (links){ $link->[0] =~ /^http:/ and url($link); $link->[0] =~ /tw.yahoo/ and push @l, $link; $link->[0] =~ /tw.yahoo/ and print ">>>".$link->[0],$/; } fetch while has_more_links; print Dumper \@l; Follow links in Google's homepage url("google.com")->() >> _self; &$_ while $_; Save links in Google's homepage url("google.com")->() >> _self | _save_as_tree("./root"); $_->() | _save_as_tree("./root") while $_; Recursively get web pages from Google url("google.com"); &$_ >> _self while $_; In English, line 1 sets the initial URL. Line 2 says, while there are more links in the queue, FEAR::API will continue fetching and feeding back the links to itself. Recursively get web pages from Google url("google.com"); &$_ >> _self | _save_as_tree("./root") while $_; In English, line 1 sets the initial URL. Line 2 says, while there are more links in the queue, FEAR::API will continue fetching and feeding back the links to itself, and saving the current document in a tree structure with its root called "root" on file system. And guess what? It is the minimal web spider written in Perl. (Well, at least, I am not aware of any other pure perl implementation.) Crawling with domain constraints allow_domains( qr(google), qr(blahblah) ); deny_domains( qr(microsoft), qr(bazzbazz) ); Mechanize fans?FEAR::API borrows (or, steals) some useful methods from WWW::Mechanize.Follow the second link of Google url("google.com")->()->follow_link(n => 2); Return links from Google's homepage print Dumper fetch("google.com")->links; Submit a query to Google url("google.com")->(); submit_form( form_number => 1, fields => { q => "Kill Bush" } ); Get links of some patternIf you have used curl before, then you may have tried to embed multiple URLs in one line. FEAR::API gives a similar functionality based on Template Toolkit. In the following code, the initial ones are http://some.site/a, http://some.site/b, ......, http://some.site/zurl("[% FOREACH i = ['a'..'z'] %] http://some.site/[% i %] [% END %]"); &$_ while $_; ExtractionUse template() to set up the template for extraction. Note that FEAR::API will add [% FOREACH rec %] and [% END %] to your template if your extraction method is set to Template::Extract.preproc() (or doc_filter()) can help you clean up document before you apply your template. postproc() (or result_filter()) is called after you perform extraction. The argument can be of two types. You can insert a string containing Perl code which will be evaluated, or you can use named filters. They are documented in FEAR::API::Filters. Extract data from CPAN url("http://search.cpan.org/recent")->(); submit_form( form_name => "f", fields => { query => "perl" }); template("<!--item-->[% p %]<!--end item-->"); # [% FOREACH rec %]<!--item-->[% p %]<!--end item-->[% END %], actually. extract; print Dumper extresult; Extract data from CPAN after some HTML cleanup url("http://search.cpan.org/recent")->(); submit_form( form_name => "f", fields => { query => "perl" }); # Only the section between <!--results--> and <!--end results--> is wanted. preproc(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s)); print document->as_string; # print content to STDOUT template("<!--item-->[% p %]<!--end item-->"); extract; print Dumper extresult; HTML cleanup, extract data, and refine results url("http://search.cpan.org/recent")->(); submit_form( form_name => "f", fields => { query => "perl" }); preproc(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s)); template("<!--item-->[% rec %]<!--end item-->"); extract; postproc(q($_->{rec} =~ s/<.+?>//g)); # Strip HTML tags brutally print Dumper extresult; Use filtering syntax fetch("http://search.cpan.org/recent"); submit_form( form_name => "f", fields => { query => "perl" }) | _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s)) | _template("<!--item-->[% rec %]<!--end item-->") | _result_filter(q($_->{rec} =~ s/<.+?>//g)); print Dumper \@$_; This is like doing piping in shell. Site scraping is actually just a flow of data. It is a process turning data into information. People usually pipe sort, wc, uniq, head, ... , etc. in shell to extract the thing they need. In FEAR::API, site scraping is equivalent to data munging. Every piece of information goes through multiple filters before the wanted information really comes out. Invoke handler for extracted results When you have results extracted, you can write handlers to process the data. invoke_handler() can takes arguments like "Data::Dumper", "YAML", a subref, an object-relational mapper, etc. And argument types are expected to grow. fetch("http://search.cpan.org/recent"); submit_form( form_name => "f", fields => { query => "perl" }) | _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s)) | "<!--item-->[% rec %]<!--end item-->" | _result_filter(q($_->{rec} =~ s/<.+?>//g)); invoke_handler('Data::Dumper'); Named FiltersHere are examples of using named filters provided by FEAR::API itself.Preprocess document url("google.com")->() | _preproc(use => "html_to_null") | _preproc(use => "decode_entities") | _print; Postprocess extraction results fetch("http://search.cpan.org/recent"); submit_form( form_name => "f", fields => { query => "perl" }) | _doc_filter(q(s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s)) | _template("<!--item-->[% rec %]<!--end item-->") | _result_filter(use => "html_to_null", qw(rec)); | _result_filter(use => "decode_entities", qw(rec)) print Dumper \@$_; ORMsFEAR::API makes it very easy to transfer your extracted data straight to databases. All you need to do is set up an ORM, and invoke the mapper once you have new results extracted. (Though I still think it's not quick enough. It's better not to create any ORMs. FEAR::API should secretly build them for you.)template($template); extract; invoke_handler('Some::Module::based::on::Class::DBI'); # or invoke_handler('Some::Module::based::on::DBIx::Class::CDBICompat'); Scraping a fileIt is possible to use FEAR::API to extract data from local files. It implies you can use other web crawlers to fetch web pages and use FEAR::API to do scraping jobs.file('somse_file'); url('file:///the/path/to/your/file'); Then you need to tell FEAR::API what the content type is because the document is loaded from your local file system. Generally, FEAR::API assumes files to be plain text. force_content_type('text/html'); THE XXX FILESFEAR::API empowers you to select sub-documents using XPath. If your document is not in XML, you have to upgrade it first.Upgrade HTML to XHTML print fetch("google.com")->document->html_to_xhtml->as_string; fetch("google.com") | _to_xhtml; print $$_; Do XPathing print fetch("google.com")->document->html_to_xhtml->xpath('/html/body/*/form')->as_string; fetch("google.com") | _to_xhtml | _xpath('/html/body/*/form'); print $$_; Make your site scraping script a subroutineIt is possible to destruct your scripts or modules into several different components using SST (Site Scraping Template).load_sst('fetch("google.com") >> _self; $_->() while $_'); run_sst; load_sst('fetch("[% initial_link %]") >> _self; $_->() while $_'); run_sst({ initial_link => 'google.com'}); # Load from a file load_sst_file("MY_SST"); run_sst({ initial_link => 'google.com'}); Tabbed scrapingI don't really know what this is good for. I added this because I saw some scraper could do this fancy stuff.fetch("google.com"); # Default tab is 0 tab 1; # Create a new tab, and switch to it. fetch("search.cpan.org"); # Fetch page in tab 1 tab 0; # Switch back to tab 0 template($template); # Continue processing in tab 0 extract(); keep_tab 1; # Keep tab 1 only and close others close_tab 1; # Close tab 1 RSSYou can create RSS feeds easily with FEAR::API.use FEAR::API -base, -rss; my $url = "http://google.com"; url($url)->(); rss_new( $url, "Google", "Google Search Engine" ); rss_language( 'en' ); rss_webmaster( 'xxxxx@yourdomain.com' ); rss_twice_daily(); rss_item(@$_) for map{ [ $_->url(), $_->text() ] } links; die "No items have been added." unless rss_item_count; rss_save('google.rss'); See also XML::RSS::SimpleGen Prefetching and document cachingHere I have designed two options for doing prefetching and document caching. One is purely written in Perl, and the other is a C++ web crawling engine. The perl solution is simple, easy-to-install, but not really efficient I think. The C++ crawler is extremely fast. It claims that it fetches 100 million pages on a home PC, with a good network. However, the C++ crawler is much more complex than the simple pure-perl prefetching.Native perl prefetching based on fork() use FEAR::API -base, -prefetching; Simple, and not efficient C++ parallel crawling based on pthread use FEAR::API -base, -larbin; Larbin is required. Amazingly fast. See also <http://larbin.sourceforge.net/index-eng.html> and larbin/README. The default document repository is at /tmp/fear-api/pf. (Non-changeable for now). ONE-LINERSfearperl -e 'fetch("google.com")' perl -M'FEAR::API -base' -e 'fetch("google.com")' ARTICLEThere is also an article about this module. Please see <http://www.perl.com/pub/a/2006/06/01/fear-api.html>.DEBATEThis module has been heavily criticized on Perlmonks. Please go to <http://perlmonks.org/?node_id=537504> for details.EXAMPLESThere are some example scrapers available with this module. Please go to examples/.SEE ALSOWWW::Mechanize, LWP::UserAgent, LWP::Simple, perlrequick, perlretut, perlre, perlreref, Regexp::Bind, Template::Extract, Template, IO::All, XML::Parser, XML::XPath, XML::RSS, XML::RSS::SimpleGen, Data::Dumper, YAML, Class::DBI, DBIx::ClassLarbin <http://larbin.sourceforge.net/index-eng.html> FEAR::Web, a web interface based on FEAR::API. <http://rt.openfoundry.org/Foundry/Project/?Queue=609> (But it needs much work.) AUTHOR & COPYRIGHTCopyright (C) 2006 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself POD ERRORSHey! The above document had some coding errors, which are explained below:
|