GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  WEBCHECK (1)

NAME

webcheck - website link checker

CONTENTS

SYNOPSIS

webcheck [OPTION]... URL

DESCRIPTION

webcheck will check the document at the specified URL for links to other documents, follow these links recursively and generate an HTML report.

-i, --internal=PATTERN
  Mark URLs matching the PATTERN (perl-type regular expression) as an internal link. Can be used multiple times. Note that the PATTERN is matched against the full URL. URLs matching this PATTERN will be considered internal, even if they match one of the --external PATTERNs.

-x, --external=PATTERN
  Mark URLs matching the PATTERN (perl-type regular expression) as an external link. Can be used multiple times. Note that the PATTERN is matched against the full URL.

-y, --yank=PATTERN
  Do not check URLs matching the PATTERN (perl-type regular expression). Like the -x flag, though this option will cause webcheck to not check the link matched by regex whereas -x will check the link but not its children. Can be used multiple times. Note that the PATTERN is matched against the full URL.

-b, --base-only
  Consider any URL not starting with the base URL to be external. For example, if you run
webcheck -b http://www.example.com/foo

then http://www.example.com/foo/bar will be considered internal whereas http://www.example.com/ will be considered external. By default all the pages on the site will be considered internal.

-a, --avoid-external
  Avoid external links. Normally if webcheck is examining an HTML page and it finds a link that points to an external document, it will check to see if that external document exists. This flag disables that action.

--ignore-robots
  Do not retrieve and parse robots.txt files. By default robots.txt files are retrieved and honored. If you are sure you want to ignore and override the webmaster’s decision this option can be used.
For more information on robots.txt handling see the NOTES section below.

-q, --quiet, --silent
  Do not print out progress as webcheck traverses a site.

-d, --debug
  Print debugging information while crawling the site. This option is mainly useful for developers.

-o, --output=DIRECTORY
  Output directory. Use to specify the directory where webcheck will dump its reports. The default is the current directory or as specified by config.py. If this directory does not exist it will be created for you (if possible).

-c, --continue
  Try to continue from a previous run. When using this option webcheck will look for a webcheck.dat in the output directory. This file is read to restore the state from the previous run. This allows webcheck to continue a previously interrupted run. When this option is used, the --internal, --external and --yank options will be ignored as well as any URL arguments. The --base-only and --avoid-external options should be the same as the previous run.
Note that this option is experimental and it’s semantics may change with coming releases (especially in relation to other options). Also note that the stored files are not guaranteed to be compatible between releases.

-f, --force
  Overwrite files without asking. This option is required for running webcheck non-interactively.

-r, --redirects=N
  Redirect depth. the number of redirects webcheck should follow when following a link. 0 implies to follow all redirects.

-u, --userpass=URL
  Specify a URL with username and password information to use for basic authentication when visiting the site.
e.g. secret@example.com">http://test:secret@example.com/
This option may be specified multiple times.

-w, --wait=SECONDS
  Wait SECONDS between document retrievals. Usually webcheck will process a url and immediately move on to the next. However on some loaded systems it may be desirable to have webcheck pause between requests. This option can be set to any non-negative number.

-v, --version
  Show version of program.

-h, --help Show short summary of options.

URL CLASSES

URLs are divided into two classes:

Internal URLs are retrieved and the retrieved item is checked for syntax. Also, the retrieved item is searched for links to other items (of any class) and these links are followed.

External URLs are only retrieved to test whether they are valid and to gather some basic information from them (title, size, content-type, etc). The retrieved items are not inspected for links to other items.

Apart from their class, URLs can also be considered yanked (as specified with the --yank or --avoid-external options). The URLs can be either internal or external and will not be retrieved or checked at all. URLs of unsupported schemes are also considered yanked.

EXAMPLES

Check the site www.example.com but consider any path with "/webcheck" in it to be external.
webcheck http://www.example.com/ -x /webcheck

NOTES

When checking internal URLs webcheck honors the robots.txt file, identifying itself as user-agent webcheck. Disallowed links will not be checked at all as if the -y option was specified for that URL. To allow webcheck to crawl parts of a site that other robots are disallowed, use something like:
User-agent: *
Disallow: /foo

User-agent: webcheck
Allow: /foo

ENVIRONMENT

<scheme>_proxy
  Proxy url for <scheme>.

REPORTING BUGS

Bug reports shoult be sent to the current maintainer <arthur@ch.tudelft.nl>. More information on reporting bugs can be found on the webcheck homepage:
http://ch.tudelft.nl/~arthur/webcheck/

COPYRIGHT

Copyright © 1998, 1999 Albert Hopkins (marduk)
Copyright © 2002 Mike W. Meyer
Copyright © 2005, 2006, 2007, 2008 Arthur de Jong
webcheck is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The files produced as output from the software do not automatically fall under the copyright of the software, unless explicitly stated otherwise.
Search for    or go to Top of page |  Section 1 |  Main Index


Version 1.10.3 WEBCHECK (1) Jul 2008

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.