GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  IFILE (1)

NAME

ifile - core executable for the ifile mail filtering system

CONTENTS

Synopsis
Description
Options
Files
Author
Examples

SYNOPSIS

ifile [-b file] [-q|-Q] [-g] [-k] [-o] [-v num] [lexing options] file ...
ifile -c -q|-Q [-T threshold] [-b file] [-g] [-k] [-o] [lexing options] file ...
ifile [-b file] [-d folder] [-i folder|-u folder] [-g] [-k] [-o] [-v num] [lexing options] file ...
ifile -r [-b file]

DESCRIPTION

ifile is a mail filter client that uses machine learning to classify e-mail into folders/mail boxes. The algorithm that it uses is called Naive Bayes. Basically, naive bayes considers each document an unordered collection of words and classifies by matching the document distribution with the most closely matching folder/mailbox distribution.

OPTIONS

-b, --db-file=file
  Location to read/store ifile database. Default is ~/.idata
-c, --concise
  equivalent of "ifile -v 0 | head -1 | cut -f1 -d". Must be used with -q or -Q.
-d, --delete=folder
  Delete the statistics for each of files from the category folder
-f, --folder-calcs=folder
  Show the word-probability calculations for folder
-g, --log-file
  Create and store debugging information in ~/.ifile.log
-i, --insert=folder
  Add the statistics for each of the files to the category folder
-k, --keep-infrequent
  Leave in the database words that occur infrequently (normally they are tossed)
-l, --query-loocv=folder
  For each of the files, temporarily removes file from folder, performs query and then reinserts file in folder. Database is not modified.
-o, --occur
  Uses document bit-vector representation. Count each word once per document.
-q, --query
  Output rating scores for each of the files
-Q, --query-insert
  For each of the files, output rating scores and add statistics for the folder with the highest score
-T, --threshold=threshold
  When used with both -c and -q, output the two highest ranking categories if their score differs by at most threshold / 1000, which can be used to detect border cases. When used with -q only and any threshold > 0, output the score difference percentage. For example,
  ifile -T1 -q foo.txt
might result in
  spam -15570.48640776
non-spam -18728.00272369
diff[spam,non-spam](%) 9.21
If so, then
  ifile -T93 -q -c foo.txt
will result in
  foo.txt spam,non-spam
whereas
  ifile -T92 -q -c foo.txt
will result in
  foo.txt spam
-r, --reset-data
  Erases all currently stored information
-u, --update=folder
  Same as ’insert’ except only adds stats if folder already exists
-v, --verbosity=num
  Amount of output while running: 0=silent, 1=quiet, 2=progress, 3=verbose, 4=debug
Lexing options:
-a, --alpha-lexer
  Lex words as sequences of alphabetic characters (default)
-A, --alpha-only-lexer
  Only lex space-separated character sequences which are composed entirely of alphabetic characters
-h, --strip-header
  Skip all of the header lines except Subject:, From: and To:
-m, --max-length=char
  Ignore portion of message after first char characters. Use entire message if char set to 0. Default is 50,000.
-p, --print-tokens
  Just tokenize and print, don’t do any other processing. Documents are returned as a list of word, frequency pairs.
-s, --no-stoplist
  Do not throw out overly frequent (stoplist) words when lexing
-S, --stemming
  Use ’Porter’ stemming algorithm when lexing documents
-w, --white-lexer
  Lex words as sequences of space separated characters
If no files are specified on the command line, ifile will use standard input as its message to process.
-?, --help
  Give this help list
--usage Give a short usage message
-V, --version
  Print program version
Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options.

FILES

~/.idata
  ifile database (default location). See FAQ included in ifile package for description of database format.

AUTHOR

Jason Rennie <jrennie@csail.mit.edu> and many others. See the ChangeLog for the full list.

EXAMPLES

Before using ifile, you need to train it. Let’s say that you have three folders, "spam", "ifile" and "friends", and the following directory structure:

/--+--spam----+--1
| +--2
| +--3
|
+--ifile---+--1
| +--2
| +--3
|
+--friends-+--1
+--2
+--3

The following commands build the ifile database in ~/.idata (use the -d option to specify a different location for the database):

ifile -h -i spam /spam/*
ifile -h -i ifile /ifile/*
ifile -h -i friends /friends/*

The -h option strips off headers besides "Subject:", "From:" and "To:". I find that -h improves ifile’s performance, but you may find otherwise for your personal collection.

Note that we have made the argument to -i the same as the corresponding folder name. This is not necessary. The argument to -i can be any word you want to use to identify a category of e-mails. The argument to -i must not include space characters (including tab, feedline, etc.).

At this point, your ~/.idata file should look something like this:

spam ifile friends
662 1020 6451
3 3 3
jrennie 9 0:3 1:18 2:16
mindspring 6 1:7 2:5
make 9 0:5 1:3
yahoo 9 0:1 1:22 2:2

The first line is the space-separated list of folders. Their ordering specifies a numbering (spam=0, ifile=1, friends=2). The second line is a token count for each folder (e.g. 662 tokens observed in the three spam messages). The third line is an e-mail count for each folder (e.g. 3 e-mails for each of spam, ifile and friends). Each following line specifies statistics for a word. The format of a line is

word age folder:count [folder:count ...]

where folder is the folder number determined by the first line ordering. Folders with a count of zero are not listed. So, the line beginning with "jrennie" indicates that "jrennie" appeared 3 times in "spam" e-mails, 18 times in "ifile" e-mails and 16 times in "friends" e-mails. The age is the number of e-mails that have been processed since the word was added to the database. Very infrequent words are pruned from the database to keep the database size down.

Now that you have a database, you might want to filter some e-mails. Say you have the following incoming e-mails:

/--inbox--+--1
+--2
+--3

To find out what folders ifile thinks these e-mails belong in, run

ifile -c -q /inbox/1
ifile -c -q /inbox/2
ifile -c -q /inbox/3

Let’s say that 1 is about ifile, 2 is spam and 3 is from a friend. Assuming ifile does its job correctly, you’ll see output like this:

/inbox/1 ifile
/inbox/2 spam
/inbox/3 friends

With such little training data, ifile is unlikely to get the labels correct, but you should get the idea :-)

Now, if you move the e-mails to the folders suggested by ifile, you’ll want to update the database accordingly. You can do this with the -i option, like before. Or, you can simply use -Q in place of -q above. This automatically adds the e-mail to the folder ifile suggests.

Now, assume for a moment that e-mail 1 was actually spam. We’ve added 1 to ifile and put it in the ifile folder. We need to move it to the spam folder and update the ifile database accordingly. We can update the database with the following command:

ifile -d ifile -i spam /inbox/1

This deletes the e-mail from "ifile" and adds it to "spam".

SEE ALSO

Examples of how to use ifile together with procmail(1) and metamail(1) can be found in the directory /usr/share/doc/ifile/examples.
Search for    or go to Top of page |  Section 1 |  Main Index


ifile 1.3.4 IFILE (1) November 2004

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.