GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
Gungho::Engine::POE(3) User Contributed Perl Documentation Gungho::Engine::POE(3)

Gungho::Engine::POE - POE Engine For Gungho

  engine:
    module: POE
    config:
      loop_delay: 5 
      client:
        spawn: 2
        agent:
          - AgentName1
          - AgentName2
        max_size: 16384
        follow_redirect: 2
        proxy: http://localhost:8080
      keepalive:
        keep_alive: 10
        max_open: 200
        max_per_host: 20
        timeout: 10
      dns:
        # disable: 1 If you want to disable DNS resolution by Gungho

Gunghog::Engine::POE gives you the full power of POE to Gungho.

You can configure the POE engine in many ways. For convenience, all second level parameter names below are written as 'parent.child'. For example, 'client.agent' will actually mean

  engine:
    module: POE
    config:
      client:
        agent: XXXXX

Or in perl,

  engine => {
    module => 'POE',
    config => {
      client => {
        agent => "XXXX"
      }
    }
  }

If you're embedding Gungho into another POE application, you probably don't want Gungho to call POE::Kernel->run(). This option can control that behavior.

If you don't want to start the kernel, then specify 0 for this option. The default is 1.

"loop_delay" specifies the number of seconds to wait until calling "dispatch" again. If you feel like Gungho is running slow, try setting this parameter to a smaller amount.

Settings this too low will cause your crawler to be constantly looking up for URLs to dispatch instead of fetching the URLs. Alays try to time the requests before going to extremes with this setting.

"spawn" specifies the number of POE::Component::Client::HTTP sessions to start. This will greatly affect your fetching speed, as PoCo::Client::HTTP tends to start jamming up after a certain number of requests have been pushed onto its queue.

If you feel like all of your other settings are correct but the actual HTTP fetch is taking too long, try setting this number to something higher.

By default this is set to 2.

Specifies the number of seconds to keep a connection in the Keepalive connection manager.

This is an important option to tweak if you're using proxies. Even though you might be accessing thousands of different URLs, POE will think that you are in fact trying to connect to the same host because you're accessing the same proxy.

Turn this to 0 if you are using a proxy.

Since version 0.80, POE::Component::Client::HTTP silently decodes the content of an HTTP response. This means that, even when the HTTP header states

  Content-Type: text/html; charset=euc-jp

Your content grabbed via $response->content() will be in decode Perl unicode. This is a side-effect from POE::Component::Client::HTTP trying to handle Content-Encoding for us, and HTTP::Request also trying to be clever.

We have devised workarounds for this. You can either set the following variables in your environment (before Gunghoe::Engine::POE is loaded) to enable the workarounds:

  GUNGHO_ENGINE_POE_SKIP_DECODE_CONTENT = 1
  # or
  GUNGHO_ENGINE_POE_FORCE_ENCODE_CONTENT = 1

See ENVIRONMENT VARIABLES for details

Gungho::Engine::POE uses PoCo::Client::Keepalive to control the connections. For the most part this has no visible effect on the user, but the "timeout" parameter dictate exactly how long the component waits for a new connection which means that, after finishing to fetch all the requests the engine waits for that amount of time before terminating. This is NORMAL.

When set to a non-null value, this will install a new subroutine in HTTP::Response's namespace, and will circumvent HTTP::Response to decode its content by explicitly passing charset = 'none' to HTTP::Response's decoded_content().

This workaround is ENABLED by default.

When set to a non-null value, this will re-encode the content back to what the Content-Type header specified the charset to be.

By default this option is disabled.

sets up the engine.

Instantiates a PoCo::Client::HTTP session and a main session that handles the main control.

Shutsdown the engine

Sends a request to the http client

The POE engine supports multiple values in the user-agent header, but this is an exception that other engines don't support. Please use define your agent strings in the top level config:

  user_agent: my_user_agent
  engine:
    module: POE
    ...

If you don't do this, components such as RobotRules won't work properly

Xango, Gungho's predecessor, tried really hard to overcome one of my pet-peeves with PoCo::Client::HTTP -- which is that, while it can handle hundreds and thousands of requests, all the requests are unnecessarily stored on memory. Xango tried to solve this, but it ended up bloating the software. We may try to tackle this later.
2008-01-24 perl v5.32.1

Search for    or go to Top of page |  Section 3 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.