GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  CRAWL (1)

NAME

crawl - a small and efficient HTTP crawler

CONTENTS

Synopsis
Description
Examples
Acknowledgements
Authors

SYNOPSIS

crawl [-u urlincl] [-e urlexcl] [-i imgincl] [-I imgexcl] [-d imgdir] [-m depth] [-c state] [-t timeout] [-A agent] [-R] [-E external] [url ...]

DESCRIPTION

The crawl utility starts a depth-first traversal of the web at the specified URLs. It stores all JPEG images that match the configured constraints.

The options are as follows:
-v level The verbosity level of crawl in regards to printing information about URL processing. The default is 1.
-u urlincl A regex(3) expression that all URLs that should be included in the traversal have to match.
-e urlexcl A regex(3) expression that determines which URLs will be excluded from the traversal.
-i imgincl A regex(3) expression that all image URLs have to match in order to be stored on disk.
-I imgexcl A regex(3) expression that determines the images that will not be stored.
-d imagedir Specifies the directory under which the images will be stored.
-m depth Specifies the maximum depth of the traversal. A 0 means that only the URLs specified on the command line will be retrieved. A -1 stands for unlimited traversal and should be used with caution.
-c state Continues a traversal that was interrupted previosly. The remaining URLs with be read from the file state.
-t timeout Specifies the time in seconds that needs to pass between successive access of a single host. The parameter is a float. The default is five seconds.
-A agent Specifies the agent string that will be included in all HTTP requests.
-R Specifies that the crawler should ignore the robots.txt file.
-E external Specifies an external filter program that can refine which URLs are to be included in the traversal. The filter program reads the URLs on stdin and outputs a single character on stdout. An output of y’ indicates that the URL may be included, n’ means that the URL should be excluded.

The source code for existing web crawlers tend to be very complicated. crawl is a very simple design with pretty simple source code.

A configuration file can be used instead of the command line arguments. The configuration file contains the MIME-type that is being used. To download other objects besides images the MIME-type needs to be adjusted accordingly. For more information, see crawl.conf.

EXAMPLES

crawl -m 0 http://www.w3.org/

Searches for images in the index page of the web consortium without following any other links.

ACKNOWLEDGEMENTS

This product includes software developed by Ericsson Radio Systems.

This product includes software developed by the University of California, Berkeley and its contributors.

AUTHORS

The crawl utility has been developed by Niels Provos.
Search for    or go to Top of page |  Section 1 |  Main Index


Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.