|<B>-hB> | <B>-helpB> | <B>-?B>||prints a brief summary of all command line options and exits.|
|<B>-cfgfileB> file||Makes <B>w3mirB> read the given configuration file. See the next section for how to write such a file.|
|<B>-rB>||Puts <B>w3mirB> into recursive mode. The default is to fetch only one document and then quit. recursive mode means that all the documents linked to the given document that are fetched, and all they link to in turn and so on. But only Iff they are in the same directory or under the same directory as the start document. Any document that is in or under the starting documents directory is said to be within the scope of retrieval.|
|<B>-faB>||Fetch All. Normally <B>w3mirB> will only get the document if it has been updated since the last time it was fetched. This switch turns that check off.|
|<B>-fsB>||Fetch Some. Not the opposite of <B>-faB>, but rather, fetch the ones we dont have already. This is handy to restart copying of a site incompletely copied by earlier, interrupted, runs of <B>w3mirB>.|
|<B>-pB> n||Pause for n seconds between getting each document. The default is 30 seconds.|
|<B>-rpB> n||Retry Pause, in seconds. When <B>w3mirB> fails to get a document for some technical reason (timeout mainly) the document will be queued for a later retry. The retry pause is how long <B>w3mirB> waits between finishing a mirror pass before starting a new one to get the still missing documents. This should be a long time, so network conditions have a chance to get better. The default is 600 seconds (10 minutes), which might be a bit too short, for batch running <B>w3mirB> I would suggest an hour (3600 seconds) or more.|
|<B>-tB> n||Number of reTries. If <B>w3mirB> cannot get all the documents by the nth retry <B>w3mirB> gives up. The default is 3.|
|<B>-drrB>||Disable Robot Rules. The robot exclusion standard is described in http://info.webcrawler.com/mak/projects/robots/norobots.html. By default <B>w3mirB> honors this standard. This option causes <B>w3mirB> to ignore it.|
|<B>-nncB>||No Newline Conversion. Normally w3mir converts the newline format of all files that the web server says is a text file. However, not all web servers are reliable, and so binary files may become corrupted due to the newline conversion w3mir performs. Use this option to stop w3mir from converting newlines. This also causes the file to be regarded as binary when written to disk, to disable the implicit newline conversion when saving text files on most non-Unix systems.|
|<B>-RB>||Remove files. Normally <B>w3mirB> will not remove files that are no longer on the server/part of the retrieved web of files. When this option is specified all files no longer needed or found on the servers will be removed. If <B>w3mirB> fails to get a document for any other reason the file will not be removed.|
Batch fetch documents whose URLs are given on the commandline.
In combination with the <B>-rB> and/or <B>-lB> switch all HTML and PDF documents will be mined for URLs, but the documents will be saved on disk unchanged. When used with the <B>-rB> switch only one single URL is allowed. When not used with the <B>-rB> switch no HTML/URL processing will be performed at all. When the <B>-BB> switch is used with <B>-rB> w3mir will not do repeated mirrorings reliably since the changes w3mir needs to do, in the documents, to work reliably are not done. In any case its best not to use <B>-RB> in combination with <B>-BB> since that can result in deleting rather more documents than expected. Hwowever, if the person writing the documents being copied is good about making references relative and placing the <HTML> tag at the beginning of documents there is a fair chance that things will work even so. But I woulnt bet on it. It will, however, work reliably for repeated mirroring if the <B>-rB> switch is not used.
When the <B>-BB> switch is specified redirects for a given document will be followed no matter where they point. The redirected-to document will be retrieved in the place of the original document. This is a potential weakness, since w3mir can be directed to fetch any document anywhere on the web.
Unless used with <B>-rB> all retrived files will be stored in one directory using the remote filename as the local filename. I.e., http://foo/bar/gazonk.html will be saved as gazonk.html. http://foo/bar/ will be saved as bar-index.html so as to avoid name colitions for the common case of URLs ending in /.
|<B>-IB>||This switch can only be used with the <B>-BB> switch, and only after it on the commandline or configuration file. When given w3mir will get URLs from standard input (i.e., w3mir can be used as the end of a pipe that produces URLs.) There should only be one URL pr. line of input.|
|<B>-qB>||Quiet. Turns off all informational messages, only errors will be output.|
|<B>-cB>||Chatty. <B>w3mirB> will output more progress information. This can be used if youre watching <B>w3mirB> work.|
|<B>-vB>||Version. Output <B>w3mirB>s version.|
|<B>-sB>||Copy the given document(s) to STDOUT.|
|<B>-fB>||Forget. The retrieved documents are not saved on disk, they are just forgotten. This can be used to prime the cache in proxy servers, or not save documents you just want to list the URLs in (see <B>-lB>).|
|<B>-lB>||List the URLs referred to in the retrieved document(s) on STDOUT.|
Sets the umask, i.e., the permission bits of all retrieved files. The
number is taken as octal unless it starts with a 0x, in which case
its taken as hexadecimal. No matter what you set this to make sure
you get write as well as read access to created files and directories.
Typical values are:
This option has no meaning, or effect, on Win32 platforms.
|<B>-PB> server:port||Use the given server and port is a HTTP proxy server. If no port is given port 80 is assumed (this is the normal HTTP port). This is useful if you are inside a firewall, or use a proxy server to save bandwidth.|
|<B>-pflushB>||Proxy flush, force the proxy server to flush its cache and re-get the document from the source. The Pragma: no-cache HTTP/1.0 header is used to implement this.|
|<B>-irB> referrer||Initial Referrer. Set the referrer of the first retrieved document. Some servers are reluctant to serve certain documents unless this is set right.|
|<B>-agentB> agent||Set the HTTP User-Agent fields value. Some servers will serve different documents according to the WWW browsers capabilities. <B>w3mirB> normally has <B>w3mirB>/version in this header field. Netscape uses things like <B>Mozilla/3.01 (X11; I; Linux 2.0.30 i586)B> and MSIE uses things like <B>Mozilla/2.0 (compatible; MSIE 3.02; Windows NT)B> (remember to enclose agent strings with spaces in with double quotes ("))|
Lower Case URLs. Some OSes, like W95 and NT, are not case sensitive
when it comes to filenames. Thus web masters using such OSes can case
filenames differently in different places (apps.html, Apps.html,
APPS.HTML). If you mirror to a Unix machine this can result in one
file on the server becoming many in the mirror. This option
lowercases all filenames so the mirror corresponds better with the
If given it must be the first option on the command line.
This option does not work perfectly. Most especially for mixed case host-names.
|<B>-dB> n||Set the debug level. A debug level higher than 0 will produce lots of extra output for debugging purposes.|
|<B>-absB>||Force all URLs to be absolute. If you retrive http://www.ifi.uio.no/~janl/index.html and it references foo.html the referense is absolutified into http://www.ifi.uio.no/~janl/foo.html. In other words, you get absolute references to the origin site if you use this option.|
Most things can be mirrored with a (long) command line. But multi server mirroring, authentication and some other things are only available through a configuration file. A configuration file can either be specified with the <B>-cfgfileB> switch, but w3mir also looks for .w3mirc (w3mir.ini on Win32 platforms) in the directory where w3mir is started from.
The configuration file consists of lines of comments and directives. A directive consists of a keyword followed by a colon (:) and then one or several arguments.
# This is a comment. And the next line is a directive: Options: recurse, remove
A comment can only start at the beginning of a line. The directive keywords are not case-sensitive, but the arguments might be.
Options: recurse | no-date-check | only-nonexistent | list-urls | lowercase | remove | batch | input-urls | no-newline-conv | list-nonmirrored This must be the first directive in a configuration file.
recurse see <B>-rB> switch. no-date-check see <B>-faB> switch. only-nonexistent see <B>-fsB> switch. list-urls see <B>-lB> option. lowercase see <B>-lcB> option. remove see <B>-RB> option. batch see <B>-BB> option. input-urls see <B>-IB> option. no-newline-conv see <B>-nncB> option. list-nonmirrored List URLs not mirrored in a file called .notmirrored (notmir on win32). It will contain a lot of duplicate lines and quite possebly be quite large. URL: HTTP-URL [target-directory] The URL directive may only appear once in any configuration file.
Without the optional target directory argument it corresponds directly to the single-HTTP-URL argument on the command line.
If the optional target directory is given all documents from under the given URL will be stored in that directory, and under. The target directory is most likely only specified if the <B>AlsoB> directive is also specified.
If the URL given refers to a directory it must end in a /, otherwise you might get quite surprised at what gets retrieved.
Either one URL: directive or the single-HTTP-URL at the command-line must be given.
Also: HTTP-URL directory This directive is only meaningful if the recurse (or <B>-rB>) option is given.
The directive enlarges the scope of a recursive retrieval to contain the given HTTP-URL and all documents in the same directory or under. Any documents retrieved because of this directive will be stored in the given directory of the mirror.
In practice this means that if the documents to be retrieved are stored on several servers, or in several hierarchies on one server or any combination of those. Then the <B>AlsoB> directive ensures that we get everything into one single mirror.
but it has inline icons or images stored in http://www.foo.org/icons/ which you will also want to get, then that will be retrieved as well by entering
Also: http://www.foo.org/icons/ icons
As with the URL directive, if the URL refers to a directory it must end in a /.
Another use for it is when mirroring sites that have several names that all refer to the same (logical) server:
At this point in time <B>w3mirB> has no mechanism to easily enlarge the scope of a mirror after it has been established. That means that you should survey the documents you are going to retrieve to find out what icons, graphics and other things they refer to that you want. And what other sites you might like to retrieve. If you find out that something is missing you will have to delete the whole mirror, add the needed <B>AlsoB> directives and then reestablish the mirror. This lack of flexibility in what to retrieve will be addressed at a later date.
Also-quene: HTTP-URL directory This is like Also, except that the URL itself is also quened. The Also directive will not cause any documents to be retrived UNLESS they are referenced by some other document w3mir has already retrived. Quene: HTTP-URL This is quenes the URL for retrival, but does not enlarge the scope of the retrival. If the URL is outside the scope of retrival it will not be retrived anyway. Initial-referer: referer see <B>-irB> option. Ignore: wildcard Fetch: wildcard Ignore-RE: regular-expression Fetch-RE: regular-expression These four are used to set up rules about which documents, within the scope of retrieval, should be gotten and which not. The default is to get anything that is within the scope of retrieval. That may not be practical though. This goes for CGI scripts, and especially server side image maps and other things that are executed/evaluated on the server. There might be other things you want unfetched as well.
<B>w3mirB> stores the Ignore/Fetch rules in a list. When a document is considered for retrieval the URL is checked against the list in the same order that the rules appeared in the configuration file. If the URL matches any rule the search stops at once. If it matched a Ignore rule the document is not fetched and any URLs in other documents pointing to it will point to the document at the original server (not inside the mirror). If it matched a Fetch rule the document is gotten. If not matched by any ruo\k:/es the document is gotten.
The perl-regular-expression is perls superset of the normal Unix regular expression syntax. They must be completely specified, including the prefixed m, a delimiter of your choice (except the paired delimiters: parenthesis, brackets and braces), and any of the RE modifiers. E.g.,
and so on. # cannot be used as delimiter as it is the comment character in the configuration file. This also has the bad side-effect of making you unable to match fragment names (#foobar) directly. Fortunately perl allows writing # as \043.
You must be very carefull of using the RE anchors (^ and $ with the RE versions of these and the Apply directive. Given the rules:
Fetch-RE: m/foobar.cgi$/ Ignore: *.cgi
the all files called foobar.cgi will be fetched. However, if the file is referenced as foobar.cgi?query=mp3 it will not be fetched since the $ anchor will prevent it from matching the Fetch-RE directive and then it will match the Ignore directive instead. If you want to match foobar.cgi but not foobar.cgifu you can use perls \b character class which matches a word boundrary:
Fetch-RE: m/foobar.cgi\b/ Ignore: *.cgi
which will get foobar.cgi as well as foobar.cgi?query=mp3 but not foobar.cgifu. BUT, you must keep in mind that a lot of diffetent characters make a word boundrary, maybe something more subtle is needed.
Apply: regular-expression This is used to change a URL into another URL. It is a potentially very powerful feature, and it also provides ample chance for you to shoot your own foot. The whole aparatus is somewhat tenative, if you find there is a need for changes in how Apply rules work please E-mail. If you are going to use this feature please read the documentation for Fetch-RE and Ignore-RE first.
The <B>ApplyB> expressions are applied, in sequence, to the URLs in their absolute form. I.e., with the whole http://host:port/dir/ec/tory/file URL. It is only after this w3mir checks if a document is within the scope of retrieval or not. That means that <B>ApplyB> rules can be used to change certain URLs to fall inside the scope of retrieval, and vice versa.
The regular-expression is perls superset of the usual Unix regular expressions for substitution. As with Fetch and Ignore rules it must be specified fully, with the s and delimiting character. It has the same restrictions with regards to delimiters. E.g.,
to translate the path element foo to bar in all URLs.
# cannot be used as delimiter as it is the comment character in the configuration file.
Please note that w3mir expects that URLs identifying directories keep idenfifying directories after application of Apply rules. Ditto for files.
Agent: agent see <B>-agentB> option. Pause: n see <B>-pB> option. Retry-Pause: n see <B>-rpB> option. Retries: n see <B>-tB> option. debug: n see <B>-dB> option. umask n see <B>-umaskB> option. Robot-Rules: on | off Turn robot rules on of off. See <B>-drrB> option. Remove-Nomirror: on | off If this is enabled sections between two consecutive
comments in a mirrored document will be removed. This editing is performed even if batch getting is specified.
Header: html/text Insert this complete html/text into the start of the document. This will be done even if batch is specified. File-Disposition: save | stdout | forget What to do with a retrieved file. The save alternative is default. The two others correspond to the <B>-sB> and <B>-fB> options. Only one may be specified. Verbosity: quiet | brief | chatty How much <B>w3mirB> informs you of its progress. Brief is the default. The two others correspond to the <B>-qB> and <B>-cB> switches. Cd: directory Change to given directory before starting work. If it does not exist it will be quietly created. Using this option breaks the fixup code so consider not using it, ever. HTTP-Proxy: server:port see the <B>-PB> switch. HTTP-Proxy-user: username HTTP-Proxy-passwd: password These two are is used to activate authentication with the proxy server. w3mir only supports basic proxy autentication, and is quite simpleminded about it, if proxy authentication is on w3mir will always give it to the proxy. The domain concept is not supported with proxy-authentication. Proxy-Options: no-pragma | revalidate | refresh | no-store Set proxy options. There are two ways to pass proxy options, HTTP/1.0 compatible and HTTP/1.1 compatible. Newer proxy-servers will understand the 1.1 way as well as 1.0. With old proxy-servers only the 1.0 way will work. w3mir will prefer the 1.0 way.
The only 1.0 compatible proxy-option is refresh, it corresponds to the <B>-pflushB> option and forces the proxy server to pass the request to a upstream server to retrieve a fresh copy of the document.
The no-pragma option forces w3mir to use the HTTP/1.1 proxy control header, use this only with servers you know to be new, otherwise it wont work at all. Use of any option but refresh will also cause HTTP/1.1 to be used.
revalidate forces the proxy server to contact the upstream server to validate that it has a fresh copy of the document. This is nicer to the net than refresh option which forces re-get of the document no matter if the server has a fresh copy already.
no-store forbids the proxy from storing the document in other than in transient storage. This can be used when transferring sensitive documents, but is by no means any warranty that the document cant be found on any storage device on the proxy-server after the transfer. Cryptography, if legal in your contry, is the solution if you want the contents to be secret.
refresh corresponds to the HTTP/1.0 header Pragma: no-cache or the identical HTTP/1.1 Cache-control option. revalidate and no-store corresponds to max-age=0 and no-store respectively.
Authorization <B>w3mirB> supports only the basic authentication of HTTP/1.0. This method can assign a password to a given user/server/realm. The user is your user-name on the server. The server is the server. The realm is a HTTP concept. It is simply a grouping of files and documents. One file or a whole directory hierarchy can belong to a realm. One server may have many realms. A user may have separate passwords for each realm, or the same password for all the realms the user has access to. A combination of a server and a realm is called a domain.
Auth-Domain: server:port/realm Give the server and port, and the belonging realm (making a domain) that the following authentication data holds for. You may specify * wildcard for either of server:port and realm, this will work well if you only have one usernme and password on all the servers mirrored. Auth-User: user Your user-name. Auth-Passwd: password Your password.
These three directives may be repeated, in clusters, as many times as needed to give the necessary authentication information
Disable-Headers: referer | user Stop <B>w3mirB> from sending the given headers. This can be used for anonymity, making your retrievals harder to track. It will be even harder if you specify a generic <B>AgentB>, like Netscape. Fixup: ... This directive controls some aspects of the separate program w3mfix. w3mfix uses the same configuration file as w3mir since it needs a lot of the information in the <B>w3mirB> configuration file to do its work correctly. <B>w3mfixB> is used to make mirrors more browseable on filesystems (disk or CDROM), and to fix redirected URLs and some other URL editing. If you want a mirror to be browseable of disk or CDROM you almost certainly need to run w3mfix. In many cases it is not necessary when you run a mirror to be used through a WWW server.
w3mfix is documented in a separate man page in a effort to not prolong this manpage unnecessarily.
Index-name: name-of-index-file When retriving URLs ending in / w3mir needs to append a filename to store it localy. The default value for this is index.html (this is the most used, its use originated in the NCSA HTTPD as far as I know). Some WWW servers use the filename Welcome.html or welcome.html instead (this was the default in the old CERN HTTPD). And servers running on limited OSes frequently use index.htm. To keep things consistent and sane w3mir and the server should use the same name. Put
when mirroring from a site that uses that convention.
Here is an example of use in the to latter cases when Welcome.html is the prefered index name:
Index-name: Welcome.html Apply: s~/index.html$~/Welcome.html~
Similarly, if index.html is the prefered index name.
Index-name is not needed since index.html is the default index name.
There are two rather extensive example files in the <B>w3mirB> distribution.
o Just get the latest Dr-Fun if it has been changed since the last time o Recursively fetch everything on the Star Wars site, remove what is no longer at the server from the mirror:
w3mir -R -r http://www.starwars.com/
o Fetch the contents of the Sega site through a proxy, pausing for 30 seconds between each document
w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/
o Do everything according to w3mir.cfg
w3mir -cfgfile w3mir.cfg
o A simple configuration file
# Remember, options first, as many as you like, comma separated Options: recurse, remove # # Start here: URL: http://www.starwars.com/ # # Speed things up Pause: 0 # # Dont get junk Ignore: *.cgi Ignore: *-cgi Ignore: *.map # # Proxy: HTTP-Proxy: www.foo.org:4321 # # You _should_ cd away from the directory where the config file is. cd: starwars # # Authentication: Auth-domain: server:port/realm Auth-user: me Auth-passwd: my_password # # You can use * in place of server:port and/or realm: Auth-domain: */* Auth-user: otherme Auth-user: otherpassword
# Retrive all of janls home pages: Options: recurse # # This is the two argument form of URL:. It fetches the first into the second URL: http://www.math.uio.no/~janl/ math/janl # # These says that any documents refered to that lives under these places # should be gotten too. Into the named directories. Two arguments are # required for Also:. Also: http://www.math.uio.no/drift/personer/ math/drift Also: http://www.ifi.uio.no/~janl/ ifi/janl Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai # # The options above will result in this directory hierarchy under # where you started w3mir: # w3mir/math/janl files from http://www.math.uio.no/~janl # w3mir/math/drift from http://www.math.uio.no/drift/personer/ # w3mir/ifi/janl from http://www.ifi.uio.no/~janl/ # w3mir/math-uib/nicolai from http://www.mi.uib.no/~nicolai/
o Ignore-RE and Fetch-RE
# Get only jpeg/jpg files, no gifs Fetch-RE: m/\.jp(e)?g$/ Ignore-RE: m/\.gif$/
As I said earlier, <B>ApplyB> has not been used for Real Work yet, that I know of. But <B>ApplyB> could, be used to map all web servers at the university of Oslo inside the scope of retrieval very easily:
# Start at the main server URL: http://www.uio.no/ # Change http://*.uio.no and http://129.240.* to be a subdirectory # of http://www.uio.no/. Apply: s~^http://(.*\.uio\.no(?:\d+)?)/~http://www.uio.no/$1/~i Apply: s~^http://(129\.240\.[^:]*(?:\d+)?)/~http://www.uio.no/$1/~i
The -lc switch does not work too well.
These are not bugs.
URLs with two /es (//) in the path component does not work as some might expect. According to my reading of the URL spec. it is an illegal construct, which is a Good Thing, because I dont know how to handle it if its legal. If you start at http://foo/bar/ then index.html might be gotten twice. Some documents point to a point above the server root, i.e., http://some.server/../stuff.html. Netscape, and other browsers, in defiance of the URL standard documents will change the URL to http://some.server/stuff.html. W3mir will not. Authentication is only tried if the server requests it. This might lead to a lot of extra connections going up and down, but thats the way its gotta work for now.
w3mirs authors can be reached at email@example.com. w3mirs home page is at http://www.math.uio.no/~janl/w3mir/
Hey! <B>The above document had some coding errors, which are explained below:B>
Around line 2582: Expected text after =item, not a number Around line 2587: Expected text after =item, not a number Around line 2591: Expected text after =item, not a number Around line 2836: Non-ASCII character seen before =encoding in ruo\k:/es. Assuming ISO8859-1 Around line 3185: Expected =item * Around line 3207: Expected =item * Around line 3213: Expected =item *
|perl v5.20.3||W3MIR (1)||2016-04-03|