|-a||This causes the program to ask the user whether to download a page that it hasnt been otherwise instructed to (by default, this means off-site pages)|
This causes the program to always follow links to URLs that
contain the string. You can use this, for example, to prevent a
crawl from going up beyond a single directory on a site (in
conjunction with the
-x option below); say you
wanted to get http://www.web-sites.co.uk/jules but not any
other site located on the same server. You could use the command
webcrawl -x -f /jules www.web-sites.co.uk/jules/ mirror
Another use would be if a site contained links to (eg) pictures, videos or sound clips on a remote server, you could use the following command line to get them:
webcrawl -f .jpg -f .gif -f .mpg -f .wav -f .au www.site.com/ mirror
Note that webcrawl always downloads inline images.
|-d string||The opposite of -f, this option tells webcrawl never to get a URL containing the string. -d takes priority over all other URL selection options (except that it wont stop it from downloading inline images, which are always downloaded).|
|Causes webcrawl to log unfollowed links to the file filename.|
|-x||Causes webcrawl not to automatically follow links to pages on the same server. This is useful in conjuction with the -f option to specify a subsection of an entire site to download.|
|-X||Causes webcrawl not to automatically download inline images (which it would otherwise do even when other options did not indicate that the image should be loaded). This is useful in conjunction with the -f option to specify a subsection of an entire site to download, when even the images concerned need careful selection.|
|-n||Turns off page rewriting completely.|
Select which URLs to rewrite. Only URLs that begin with / or http:
are considered for rewriting, all others are always left unchanged.
This options selects which of these URLs are rewritten to point to
local files, depending on the value of
|-k||Keep original filenames - disables changing of filenames to remove metacharacters that may confuse a web server, and to ensure that the extension on the end of the filename is a correct .html or .htm whenever the page has a text/html content type. (See Configuration Files below for a discusssion of how to achieve this with other file types).|
|-q||Disable process ID insertion into query filenames. Without this flag, and whenever -k is not in use, webcrawl rewrites the filenames of queries (defined as any fetch from a web server that includes a ? character in the filename) to include the process ID of the webcrawl fetching the query in hexadecimal after the (escaped) ? in the filename; this may be desirable if performing the same query multiple times to get different results. This flag disables this behaviour.|
|This option is used to limit the depth to which webcrawl will search the tree (forest) of interlinked pages. There are two parameters that may be set; with x as l, the initial limit is set, with x as r, the limit used after jumping to a remote site is set. If x is missed out, both limits are set.|
|-v||Increases the programs verbosity. Without this option, no reports on status are made unless errors occur, etc. Used once, webcrawl will report which URLs it is trying to download, and also which links it has decided not to follow. -v may be used more than once, but this is probably only useful for debugging purposes.|
|-o dir||Change the server root directory. This is the directory that the path specified at the end of the command line is relative to.|
Change the URL rewriting prefix. This is prepended to rewritten
URLs, and should be a (relative) URL that points to the current
server root directory. An example of the use of the
-p options is given below:
webcrawl -o /home/jules/public_html -p /~jules www.site.com/page.html mirrors
|Causes webcrawl to send the specified string as the HTTP User-Agent value, rather than the compiled in default (normally Mozilla/4.05 [en] (X11; I; Linux 2.0.27 i586; Nav), although this can be changed in the file web.h at compile time).|
|-t n||Specifies a timeout, in seconds. Default behaviour is to give up after this length of time from the initial connection attempt.|
|-T||Changes the timeout behaviour. With this flag, the timeout occurs only if no data is received from the server for the specified length of time.|
webcrawl uses configuration files at present to specify rules for the rewriting of filenames. It searches for files in /etc/webcrawl.conf, /usr/local/etc/webcrawl.conf, and $HOME/.webcrawl and processes all files it finds in that order. Parameters set in one file may be overriden by subsequent files. Note that it is perfectly possible to use webcrawl without a configuration file - it is only for advanced features that are too complex to configure on the command line that it is required.
The overall syntax of the webcrawl file is a set of sections, each headed by a line of the form [section-name].
At present, only the [rename] section is defined. This may contain the following commands:
meta string Sets metacharacter list. Any character in the list specified will be quoted in filenames produced (unless filename rewriting is disabled with the -k option). Quoting is performed by prepending the quoting character (default @) to the hexadecimal ASCII value of the character being quoted. The default metacharacter list is: ?&*%=# quote char Sets the quoting character, as described above. The default is: @ type content/type preferred [extra extra ...] Sets the list of acceptable extensions for the specifed MIME content type. The first item in the list is the preferred extension; if renaming is not disabled (with the -k option) and the extension of a file of this type is not on the list, then the first extension on the list will be appended to its name.
An implicit line is defined internally, which reads:
type text/html html htm
This could be overriden; if say you preferred the htm extension over html, you could use:
type text/html htm html
in a configuration file to cause .htm extensions to be used whenever a new extension was added.
WebCrawl was written by Julian R. Hall <email@example.com> with suggestions and prompting by Andy Smith.
Bugs should be submitted to Julian Hall at the address above. Please include information about what architecture, version, etc, you are using.