Many slick, high profile corporate Web sites I visit seemed
to exhibit terrible grammar completely inconsistent with the
obvious investment in graphics and design. Apostrophes and
quote marks were frequently omitted, and every couple of
paragraphs words were run together which should have
been separated by a punctuation mark of some kind.
This remained a mystery to me until I wanted to convert a
presentation Id developed in 1996 using Microsoft PowerPoint
into a set of Web pages. A friend was kind enough to run the
presentation through PowerPoints Save as HTML feature
(I have abandoned all use of Microsoft products, so I did not have
a current version of PowerPoint which includes this feature).
When I got the PowerPoint-generated HTML back and viewed it
in my browser, I discovered that it contained
precisely the same grammatical errors Id noted on so many Web sites, and which certainly
were not present in my original presentation.
A little detective work revealed that, as is usually the case when you
encounter something shoddy in the vicinity of a computer, Microsoft
incompetence and gratuitous incompatibility were to blame. Western language HTML
documents are written in the ISO 8859-1 Latin-1 character set, with a
specified set of escapes for special characters. Blithely ignoring
this prescription, as usual, Microsoft use their own "extension" to
Latin-1, in which a variety of characters which do not appear in
Latin-1 are inserted in the range 0x82 through 0x95--this having the
merit of being incompatible with both Latin-1 and Unicode, which
reserve this region for additional control characters.
These characters include open and close single and double quotes,
em and en dashes, an ellipsis and a variety of other things
youve been dying for, such as a capital Y umlaut and a
florin symbol. Well, okay, you say, if Microsoft want to have
their own little incompatible character set, why not? Because
it doesnt stop there--in their inimitable fashion (who would
want to?)--they aggressively pollute the Web pages of unknowing
and innocent victims worldwide with these characters, with the
result that the owners of these pages look like semi-literate
morons when their pages are viewed on non-Microsoft platforms
(or on Microsoft platforms, for that matter, if the user has
selected as the browsers font one of the many TrueType fonts
which do not include the incompatible Microsoft characters).
You see, state of the art Microsoft Office applications
sport a nifty feature called smart quotes. (Rule of thumb--every
time Microsoft use the word smart, be on the lookout for something dumb).
This feature is on by default in both Word and PowerPoint, and can
be disabled only by finding the little box buried among the
dozens of bewildering option panels these products contain.
If enabled, and you type the string,
"Halt," he cried, "this is the police!"
smart quotes transforms the ASCII quote characters automatically
into the incompatible Microsoft opening and closing quotes.
ASCII single and double quotes are similarly transformed (even
though ASCII already contains apostrophe and single open quote
characters), and double hyphens are replaced by the incompatible
em dash symbol. What other horrors occur, I know not.
If the user notices this happening at all, their reaction
might be Thank you Billy-boy--that looks ever so much nicer,
not knowing theyve been set up to look like a moron to
folks all over the world.
You see, when you export a document as text for hand-editing
into HTML, or avail yourself of the Save as HTML features
in newer versions of Office applications, these
incompatible, Microsoft-specific characters
remain in place. When viewed by a user on a non-Microsoft platform, they
will not be displayed properly--most browsers seem to
just drop them, as opposed to including a symbol
indicating an undisplayable character. Hence, the
apparently ungrammatical text, which the author of the
page, editing on a Microsoft platform, will never
be aware of.
Having no desire to hand-edit the HTML for a long presentation
to correct a raft of Microsoft-induced incompatibilities, I
wrote a Perl program, the
demoroniser, to transform Microsofts junk HTML into at least a starting
point for something Id consider presentable on my site.
In addition to replacing the incompatible characters with
HTML-compliant equivalents wherever possible (a few rarely-encountered
characters which cant be translated result in warning messages
if encountered), the following sloppy or downright wrong HTML
The missing semicolon at the end of numeric character
escapes (=) is supplied.
Numeric renderings of special characters (< > &)
are replaced with readable equivalents.
Unquoted <table> tags containing non-alphanumeric
characters are quoted.
PowerPoints mis-nesting of <font> and <strong> tags
PowerPoints boneheaded use of <ul> and </ul> tags to
accomplish paragraph breaks is corrected and the
proper <p> tags inserted.
Missing <tr> tags in text-only slides are inserted.
Nugatory </p> tags are removed.
Unmatched <li> tags in headings are removed.
Idiot paragraph-long lines are broken into
something suitable for editing with a normal text editor.