crawl — a small
    and efficient HTTP crawler
  
    crawl | 
    [-u urlincl]
      [-e urlexcl]
      [-i imgincl]
      [-I imgexcl]
      [-d imgdir]
      [-m depth]
      [-c state]
      [-t timeout]
      [-A agent]
      [-R] [-E
      external] [url ...] | 
  
The crawl utility starts a depth-first
    traversal of the web at the specified URLs. It stores all JPEG images that
    match the configured constraints.
The options are as follows:
  -v
    level 
  - The verbosity level of 
crawl in regards to
      printing information about URL processing. The default is 1. 
  -u
    urlincl 
  - A
      regex(3)
      expression that all URLs that should be included in the traversal have to
      match.
 
  -e
    urlexcl 
  - A
      regex(3)
      expression that determines which URLs will be excluded from the
    traversal.
 
  -i
    imgincl 
  - A
      regex(3)
      expression that all image URLs have to match in order to be stored on
      disk.
 
  -I
    imgexcl 
  - A
      regex(3)
      expression that determines the images that will not be stored.
 
  -d
    imagedir 
  - Specifies the directory under which the images will be stored.
 
  -m
    depth 
  - Specifies the maximum depth of the traversal. A 0 means that only the URLs
      specified on the command line will be retrieved. A -1 stands for unlimited
      traversal and should be used with caution.
 
  -c
    state 
  - Continues a traversal that was interrupted previosly. The remaining URLs
      with be read from the file state.
 
  -t
    timeout 
  - Specifies the time in seconds that needs to pass between successive access
      of a single host. The parameter is a float. The default is five
    seconds.
 
  -A
    agent 
  - Specifies the agent string that will be included in all HTTP
    requests.
 
  -R 
  - Specifies that the crawler should ignore the
      robots.txt file.
 
  -E
    external 
  - Specifies an external filter program that can refine which URLs are to be
      included in the traversal. The filter program reads the URLs on
      
stdin and outputs a single character on
      stdout. An output of
      ‘y’ indicates that the URL may be
      included, ‘n’ means that the URL
      should be excluded. 
The source code for existing web crawlers tend to be very
    complicated. crawl is a very simple design with
    pretty simple source code.
A configuration file can be used instead of the command line
    arguments. The configuration file contains the MIME-type that is being used.
    To download other objects besides images the MIME-type needs to be adjusted
    accordingly. For more information, see
  crawl.conf.
crawl -m 0 http://www.w3.org/
Searches for images in the index page of the web consortium
    without following any other links.
This product includes software developed by Ericsson Radio
    Systems.
This product includes software developed by the University of
    California, Berkeley and its contributors.
The crawl utility has been developed by
    Niels Provos.