mls - Display useful statistics on email messages
mls [-hvq] [-l lang] [-i file] [-o
file] [-r|w|u file] [-t|T text] [-m
mode] [-n XX] [-g xxxx]
is program that prints some "useful" statistical
info on email messages. It's main usage is in email conferences - mailing
lists. Currently it displays both tables
. You can
select either TEXT or HTML output.
- print help text and exit
- be quiet (print only errors to stderr)
- turn on verbose mode - in this mode it will print more info to
stderr - indication of progress (will print every 10th, 20th, ...,
90th, 100th, 200th, ..., 900th, 1000th, 2000th ... message being
processed) and warnings about malformed headers found
- -l lang
- select output language; please note that this applies only to generated
statistics - program messages printed to stderr ale always in
English. These languages are currently supported: EN (English),
SK (Slovak), IT (Italian), FR (Francais), DE
(Deutsch), ES (Spanish), SR (Serbian), BR (Portugues
- -i file
- name of input file (if not specified, use stdin). This file should
be in MBOX format. It should exist and be readable.
- -o file
- name of output file (if not specified, use stdout). If exists, it
will be overwritten.
- -r file
- read input from cache file instead of mailbox. You can read input either
from mailbox or cache file, not both!
- -w file
- write cache file (no stats produced). You can either produce text output
or write cache file, not both! When writing cache file, output-related
options are ignored.
- -u file
- update cache file = read cache, read input, write cache. For use with
- -t text
- name of mailing list this statistics is computed for. If specified, it is
just appended to the title of statistics, so it will be like
"Statistics from 16.8.2001 to 7.9.2001 for text", where
text is whatever you put as this parameter (it could be name of the
mailing list or just its email, e.g. firstname.lastname@example.org).
- -T text
- title text (only this will be printed as title); this can be used to
supress normal title text (date of oldest/newest msg) and completely
replace it with your text.
- -m mode
- select mode of output (text, html, html2).
- -n XX
- show TOP XX tables (default TOP 10). By default, mls displays
tables of TOP 10 people, subjects, quoting or whatever. Using this
parameter, you can define how many lines shall these tables have.
- -g xxxx
- graphs to show (Day, Week, Month, Year, Xnone) - specify first letter
(e.g. -g dmy).
- Everything went OK and no error occurred.
- Error in sscanf() while reading & parsing cache file. It means that
the format of cache file is invalid. Try to create the cache file
- Invalid command-line option/language. You have specified an invalid
- Cannot open input/output file. Please check that you have typed correct
filename and that you have read permissions for input file and write
permissions to destination directory (because output file must be
created). If output file exists, it's overwritten.
- Not enough memory is available for dynamically allocated variables. This
could be caused by user-limits, because mls requires only few MBs
of memory (it depends on number of messages processed and number of
different subjects and authors).
- Error compiling regex. This error should not occur in world-available
On input, there should be mailbox file in standard MBOX format. If the file is
in different format, the results are unpredictable. There should be at least
one email message, otherwise no stats can be computed.
Be sure that no special messages are in input files (such as
that with "DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA"
subject), because they will be also analysed. Many programs (POP3/IMAP
daemons, email readers) put their special messages to the mailbox. This
message is only ignored when reporting oldest message found.
Statistics is put into output file (or stdout if unspecified) in specified
language. All diagnostic messages are written to stderr and are in English.
Output consists of several statistical data - tables, graphs and summaries.
The title has two formats depending on -t
parameter. If it's not
specified, it looks like "Statistics from 16.8.2001 to 7.9.2001",
where first date is date of the oldest message found in input and second is
date of the newest one. If there is for example -t email@example.com
parameter, it will look like "Statistics from 16.8.2001 to 7.9.2001 for
firstname.lastname@example.org". The problem is that date of oldest & newest msg is
often wrong (thanks to bad date/time settings on PC of msg author), so you can
specify entire title using command-line option -T
. When used, only your
text will be printed as title, nothing more. There you can put for example
something like "Statistics for email@example.com".
Now you have option ( -g
) to specify which graphs you want to show -
hours of Day, days of Week, days of Month, months of Year. Use 1st letters as
argument to -g option (so -g dw
will print just hours of Day and days
of Week). Use -g x
to disable printing of any graph. For example you
don't want to show graph for months of Year if you are presenting stats for
one month, but for full-year stats you probably want it.
You can choose between 2 modes of output - TEXT and HTML. When in HTML mode,
will produce the output as HTML page. When you specify HTML2 mode,
only the body of HTML document is produced (no header/footer) - it can be used
to have different HTML header/footer when calling mls
as CGI or when
using PHP wrapper. The output consists of HTML tables and bar graphs. Almost
every aspect of how it looks can be configured by modifying CSS style-sheet.
Please note that files style_mls.css
must be present
in the same directory as produced HTML file. You can, however, modify both to
best suit your needs. Everything should be clear after reading comments in CSS
file and looking at the produced HTML source.
I was unsure what type of graphs to produce. I have tried also horizontal bar
graphs and if you want to try them, just uncomment part of code in
PrintGraphHtml() in mls_text.c.
Instead of producing statistics in text format, you can save all the generated
values/results into "cache" file. Retrieving information from this
file is very fast, so it is useful for integration with web pages. Now you can
update the cache file just after new mail was received. Users can view actual
stats using mls
as CGI script. It has an advantage over static stats that user can choose
language and others options and it will be generated in a moment!
To update cache file, use the -u
option. It works like this: first, the
stats are loaded from cache file (doesn't have to exist) and then new
message(s) to be added are read from stdin (or from -i file) and added to the
stats. Finally the updated stats are written back to the cache file. The
process is really quick, because usually only one message is added at a time.
This is useful mainly for updating cache files upon receiving new message. In
the "examples/" subdir, you can find examples of integration with
your .forward and .procmailrc files. By running MLS more than once, you can
generate cache files for individual months and also for whole years (see
examples). Then use some PHP script to present list of these cache files to
Format of cache files was changed in version 1.3, because of new stats added.
Now it contains version info, so mls can inform you that you have to re-create
that cache file with new version. Unfortunately, you have to re-create them
also when you want new email clients to be recognized also in old (already
processed) messages. Note that email clients detection was buggy in 1.2.2 (a
lot of clients not recognized).
I have written also PHP wrapper for mls
to make it more
"interactive". It has two major advantages over plain HTML output
: User can choose output language and number of TOP items to
show. It works by running mls
with appopriate command-line options.
It's safe, because only two items from user are language and topXX which are
checked using regexp, so running arbitrary code is not possible. You can also
output - for example change @ in email addresses to (at) to
You can have normal MBOX file as input, but I recommend using cache file. When
using cache file, the stats are produced in a moment. You can see how long it
took to generate the page, see the last line of HTML source. However, there is
minor speed problem. It takes longer when you specify to show many topXX (like
999). The problem is regexp that searches for @. It has to search for it in
output together and when it is large, it takes a while (1.1
seconds on my 2.1GHz pentium4). I have added an option which should use
Perl-compatible regex function (preg_replace) instead of POSIX (ereg_replace),
if available. This will result in MUCH faster execution (50ms instead of
OK, so let's start from beginning - the format of MBOX file. It's plain text
file containing some email messages delimited with one empty line. Each
message starts with line like this From firstname.lastname@example.org Thu Aug 16 15:48:58
. After this line, there are few headers, one empty line and message
text. Storing emails in this format is quite common - your incoming mail is
usually saved in MBOX format and also your folders in mail-readers like
Who is author of an email message? It's taken from From:
header field and
everything except the actual email address (like your full name) is stripped
off using quite simple regular expression (regexp).
Subject is taken from Subject:
header field. If it contains some
, those will be stripped off. There can be up to 5 of them. Also
counted format ( Re:
) is supported. For example The Bat! email
client uses it. MIME-decoding is applied to subject lines (see below).
Date is just everything in the Date:
header. This header is generated by
the email client, so it's date of message creation and it doesn't have to be
present in each message. If it isn't, you are warned by message like
"Warning: 1 message(s) not counted." in output. Some clients don't
put full date there and usually the day of week is missing and you are warned.
No timezones are considered, the date is taken as-is.
Message size is everything between end of message header and beginning of new
email (or end of file). So only actual size of message text (body) is counted,
Email clients are taken from X-Mailer:
headers and some grouping is done to avoid different
versions of the same mailer to take the whole TOP 10. There is also
work-around for Pine mailer (MLS will search also Message-ID:
What is quoting? When you reply to some message, you can insert part of the
original message there, you quote the author of original message. Every line
of original text is usually prepended with >
MP are initials of the original sender's name (for example The Bat! uses this
And what is "quote ratio"? It's size of quoted text divided by total
size of message, specified in percent. It's included in stats, because many
people reply to message, add one line of text and leaving there for example 10
pages of original text, which makes the quote ratio even higher than 90%! In
times of FIDONET, there were conferences, where quote ratio higher than 50%
was forbidden. Try to think about it when replying to message in mailing list
where more than 300 people will download and read it.
At first, there are TOP 10 tables (or TOP XX when using -n XX
First table shows people who have written most messages, how much and how many
percent of total message count it is. Last row shows the "other" -
number of messages written by everyone not listed above and how many percent
it is. Second and third tables are similar to this one - they also show best
authors, but not by the number of messages written. Authors are sorted by
total (or average) size of all their messages, but without quoting (size of
message minus how much was quoted in that msg). Next table shows most
successful subjects and how many messages with this subject have been posted.
The other table shows most used email clients. The last table show people with
maximal quote ratio. It's computed as sum of quoted text in all his/her
messages divided by total size of those messages. Last row shows an average -
sum of quoted text in all messages divided by total size of all messages.
Next part of stats are some graphs. They show how much messages have been
written during different hours of day, days of month and days of week. From
these you can see for example when (and how much) people sleep :) or if they
work during the working-hours or just write tons of messages...
Next part contains info about messages which are BEST in something - message
with max. quote ratio, longest message and some details about most successful
At the end, there is final summary - total number of messages, their total and
average size and number of different authors and subjects.
What is it? Original implementation email permitted only 7bit ASCII messages.
But during the time, there was need to send international or even binary
files. MIME defines how can these be encoded into 7bit form suitable for
emailing and how to decode it back to human readable form.
In email message, you can have MIME-encoded text (body of message), but also
some headers - for example subject and From field. MLS
tries to find
out if subject lines are MIME-encoded and if so, it tries to decode it, to
present it to you in human-readable form. You can read more about MIME in RFC
1521 and 1522.
I was inspired by similar DOS program used before few years in FIDONET and
Slovak ULTRANET. It was created by Ivan Friedlander.
- doesn't support header fields splitted to more lines (you can use
formail(1) to put them to one line before using MLS)
- charset conversion in MIME-decoding
- more stats
This man page is written for mls
(MailListStat) is written by Marek -Marki- Podmaka
for more information and latest version of mls
- print useful statistics on email messages Copyright (C)
2001-2003 Marek Podmaka <email@example.com>
This program is free software; you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any later
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place, Suite 330, Boston, MA 02111-1307 USA