GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  INDEXER.CONF (5)

NAME

indexer.conf - configuration file for indexer

CONTENTS

Description
Variables

DESCRIPTION

This is configuration file for indexer (1). Configuration file consists of commands and their arguments. All commands are case-insensitive. You can use # to comment out lines.

VARIABLES

Commands to define SQL Database connection parameters
  These commands should be used only once and take global effect for the whole configuration file.
DBHost host
  SQL host name (Not required for ODBC)

Default: localhost

DBName udmsearch
  SQL database name or ODBC DSN

Default: udmsearch

DBUser foo
  Database username to connect to database

Default: no user

DBPass bar
  Database password to connect to database

Default: no password

Other options
FollowOutside yes|no
  Allow/disallow indexer to walk outside current server. Should be used carefully (see MaxHops command).

Default: no

Period seconds
  Reindex period in seconds, 604800 = 1 week. May be used before every Server command and takes effect till the end of config file or till next Period command.
Tag number
  Use this parameter for your own purposes. For example for grouping some servers into one group, etc. May be used multiple times before every Server command and takes effect till the end of config file or till next Tag command.
MaxHops number
  Maximum way in "mouse clicks" from start URL given in Server command. May be used multiple times before every Server command and takes effect till the end of config file or till next MaxHops command.

Default: 256

MaxNetErrors number
  Maximum network errors for each server. If there are too many network errors on some server (server is down, host unreachable etc.) indexer will try to do not more then number attempts to connect to this server. May be used multiple times before Server command and takes effect till the end of config file or till next MaxNetErrors command.

Default: 16

SyslogFacility facility
  Useful only if indexer is compiled with syslog support and if you don’t like the default. Argument is the same as used in syslog.conf file (for example: local7 , daemon ). For list of possible facilities see syslog.conf(5) Takes global effect and should be used only once !

Default: depends on compilation

TitleWeight number
  Weight of the words in the <title>...</title> Can be set multiple times before Server command and takes effect till the end of config file or till next TitleWeight command.

Default: 2

BodyWeight number
  Weight of the words in the <body>...</body> of the html documents and in the contents of the text/plain documents. Can be set multiple times before Server command and takes effect till the end of config file or till next BodyWeight command.

Default: 1

DescWeight number
  Weight of the words in the <META NAME="Description" Content="..."> Can be set multiple times before Server command and takes effect till the end of config file or till next DescWeight command.

Default: 2

KeywordWeight number
  Weight of the words in the <META NAME="Keywords" Content="..."> Can be set multiple times before Server command and takes effect till the end of config file or till next KeywordWeight command.

Default: 2

UrlWeight number
  Weight of the words in the URL of the documents. Can be set multiple times before Server command and takes effect till the end of config file or till next UrlWeight command.

Default: 0

DeleteBad yes|no
  Prevent indexer from deleting bad (not found, forbidden etc) URLs from database. Useful if you want to check ’integrity’ of you server(s), so if you set it to , that "bad" URLs will remain in database. Can be set multiple times before Server command and takes effect till the end of config file or till next DeleteBad command.

Default: yes

Robots yes|no
  Allows/disallows using robots.txt and <META NAME="robots"> exclusions. Useful if you want to check ’integrity’ of you server(s). Can be set multiple times before Server command and takes effect till the end of config file or till next Robots command.

Default: yes.

Index yes|no
  Prevent indexer from storing words into database. Useful if you want to check ’integrity’ of you server(s). Can be set multiple times before "Server" command and takes effect till the end of config file or till next Index command.

Note: Instead of Index no you can use the alternate form NoIndex

Default: yes

Follow yes|no
  Allow/disallow indexer to store <a href="..."> into database. Can be set multiple times before Server command and takes effect till the end of config file or till next Follow command.

Note: Instead of Follow no you can use the alternate form NoFollow

Default: yes

MaxDocSize size
 

Hope the name is self-explanatory, this command is to limit maximum document size. size is in bytes. If there will be document with size more than size , indexer will parse only first size bytes of documents.

Default: 1048576 (which is 1 megabyte)

Mime <from_mime> <to_mime>[;charset] [ command line [$1] ]

This is used to add support for parsing documents with mime types other than text/plain and text/html. It can be done via external parser (which should provide output in plain or html text) or just by substituting mime type so indexer will understand it directly.

<from_mime> and <to_mime> are standard mime types. <to_mime> should be either text/plain or text/html , because these are the only types that indexer understands.
We assume external parser generates results on stdout (if not, you have to write a little script and cat results to stdout).

Optional charset parameter used to change charset if needed.

Command line parameter is optional. If there’s no command line, this is used to change mime type. Command line could also have $1 parameter which stands for temporary file name. Some parsers could not operate on stdin, so indexer creates temporary file for parser and it’s name passed instead of $1.

CharSet charset
  charset is default character set of server in next Server command(s). May be used before every Server command and takes effect till the end of config file or till next CharSet command.

By now indexer supports Cyrillic koi8-r, cp1251, cp866, iso8859-5, x-mac-cyrillic, Arabic cp1256, Western iso-8859-1, Central Europe iso-8859-2 and cp1250 character sets.

This parameter is default character set for "bad" servers that do not send information about charset in header: just "Content-type: text/html" instead of for example "Content-type: text/html; charset=koi8-r" and do not send charset information in META tags.

CharSet command.

Examples:
 

CharSet koi8-r
CharSet windows-1250
CharSet ISO-8859-1

ForceIISCharset1251 yes/no
  This option is useful for users which deals with Cyrillic content and broken (or misconfigured?) Microsoft IIS web servers, which tends to not report charset correctly. This is really dirty hack, but if this option is turned on it is assumed that all servers which reports as ’Microsoft’ or ’IIS’ have content in Windows-1251 codepage. This command should be used only once in configuration file and takes global effect.

Default: no

Proxy your.proxy.host[:port]
  Use proxy rather then connect directly. You can index ftp servers (only) when using proxy. If port is not specified, it is set to default value of 3128 (Squid). If proxy host is not specified, direct connection will be performed. Can be set before every Server command and takes effect till the end of config file or till next Proxy command.
Examples:
  Proxy atoll.anywhere.com
- proxy on atoll.anywhere.com, port 3128

Proxy lota.anywhere.com:8090
- proxy on lota.anywhere.com, port 8090

Proxy
- turn off proxy usage (direct connection)

Server URL
  It is the main configuration command. Use this to add start URL of server to be indexed. You may use many Server commands in the same indexer.conf file
Examples:
 

Server http://localhost/
Server http://www.yoursite.com/
Server http://www.yoursite.com/~yourname/
Server ftp://ftp.yourdomain.com/pub/

AuthBasic login:passwd
  Use basic http authorization. Can be set before every Server command and takes effect only for next Server command.
Examples:
 

AuthBasic somebody:something

If you have password protected directory(ies), but whole server is open, use:

AuthBasic login1:passwd1
Server http://my.server.com/my/secure/directory1/
AuthBasic login2:passwd2
Server http://my.server.com/my/secure/directory2/
Server http://my.server.com/

CheckOnly regexp [regexp [...] ]
  Indexer will use HEAD instead of GET http method for URLs that matches regexp. It means that the file will be checked only and will not be downloaded. Usefull for zip,exe,arj etc files. One can use several arguments for one ’CheckOnly’ command. One can use this command any times but not more than MAXFILTER in indexer.h Takes global effect for config file.
Examples:
  #Use HEAD method for some known non-text extensions:
CheckOnly \.b$  \.sh$   \.md5$
CheckOnly \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
CheckOnly \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
CheckOnly \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
CheckOnly \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
CheckOnly \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$
CheckOnly \.vrml$ \.wrl$
CheckOnly \.exe$ \.cab$ \.dll$ \.bin$ \.class$
CheckOnly \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
CheckOnly \.rtf$ \.pdf$ \.cdf$ \.ps$
CheckOnly \.ai$ \.eps$ \.ppt$ \.hqx$
CheckOnly \.cpt$ \.bms$ \.oda$ \.tcl$
CheckOnly \.rpm$
HrefOnly regexp [regexp [...] ]
  Indexer scans html documents that matches regexp as it would scan any other URLs, except that it will not index the contents. It will add any URLs it finds in the html document to the database. Usefull when indexing mail list archives with big index pages which contains mostly URLs. One can use several arguments for one ’HrefOnly’ command. One can use this command any times but not more than MAXFILTER in indexer.h Takes global effect for config file.
Examples:
  #Scan these files for href tags only, but do not index there contents.
HrefOnly mail.*\.html$ thr.*\.html$
Disallow regexp [regexp [...] ]
  Use this to disallow indexing documents with URLs that matches given regexp. You can use several arguments for one Disallow command. You can use this command several times, but maximum numbers of filters should be no more than MAXFILTER in indexer.h Takes global effect for config file.
Example:
  #Exclude cgi-bin and non-parsed-headers
Disallow /cgi-bin/ \.cgi /nph

#Exclude some known extensions
Disallow \.b$   \.sh$   \.md5$
Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$

#Exclude Apache directory list in different sort order
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$

#Exclude ./. and ./.. from Apache and Squid directory list
Disallow /[.]{1,2} /\%2e /\%2f

EXAMPLE

This is a minimal sample indexer config file
 

DBHost          localhost
DBName          udmsearch
DBUser          foo
DBPass          bar
Server          http://localhost/
Disallow /cgi-bin/ \.cgi /nph
Disallow \.b$   \.sh$   \.md5$
Disallow \.arj$ \.tar$ \.zip$ \.tgz$ \.gz$
Disallow \.lha$ \.lzh$ \.tar\.Z$ \.rar$ \.zoo$
Disallow \.gif$ \.jpg$ \.jpeg$ \.bmp$ \.tiff$
Disallow \.vdo$ \.mpeg$ \.mpe$ \.mpg$ \.avi$ \.movie$
Disallow \.mid$ \.mp3$ \.rm$ \.ram$ \.wav$ \.aiff$ \.ra$
Disallow \.vrml$ \.wrl$
Disallow \.exe$ \.cab$ \.dll$ \.bin$ \.class$
Disallow \.tex$ \.texi$ \.xls$ \.doc$ \.texinfo$
Disallow \.rtf$ \.pdf$ \.cdf$ \.ps$
Disallow \.ai$ \.eps$ \.ppt$ \.hqx$
Disallow \.cpt$ \.bms$ \.oda$ \.tcl$
Disallow \.rpm$
Disallow \?D=A$ \?D=A$ \?D=D$ \?M=A$ \?M=D$ \?N=A$ \?N=D$ \?S=A$ \?S=D$
Disallow /[.]{1,2} /\%2e /\%2f

SEE ALSO

indexer(1), syslog.conf(5)
Search for    or go to Top of page |  Section 5 |  Main Index


UdmSearch 2.0 INDEXER.CONF (5) 21 April 1999

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.