catdoc - reads MS-Word file and puts its content as plain text on
    standard output
catdoc [-vlu8btawxV] [-m number] [
    -s charset] [ -d charset] [ -f
    output-format] file
catdoc behaves much like cat(1) but it reads MS-Word
    file and produces human-readable text on standard output. Optionally it can
    use latex(1) escape sequences for characters which have special
    meaning for LaTeX. It also makes some effort to recognize MS-Word tables,
    although it never tries to write correct headers for LaTeX tabular
    environment. Additional output formats, such is HTML can be easily
  defined.
catdoc doesn't attempt to extract formatting information
    other than tables from MS-Word document, so different output modes means
    mainly that different characters should be escaped and different ways used
    to represent characters, missing from output charset. See CHARACTER
    SUBSTITUTION below
catdoc uses internal unicode(4) representation of
    text, so it is able to convert texts when charset in source document doesn't
    match charset on target system. See CHARACTER SETS below.
If no file names supplied, catdoc processes its standard
    input unless it is terminal. It is unlikely that somebody could type Word
    document from keyboard, so if catdoc invoked without arguments and
    stdin is not redirected, it prints brief usage message and exits. Processing
    of standard input (even among other files) can be forced using dash '-' as
    file name.
By default, catdoc wraps lines which are more than 72 chars
    long and separates paragraphs by blank lines. This behavior can be turned of
    by -w switch. In wide mode catdoc prints each paragraph as
    one long line, suitable for import into word processors that perform
    word wrapping.
  - -a
 
  - - shortcut for -f ascii. Produces ASCII text as output. Separates table
      columns with TAB
 
  - -b
 
  - - process broken MS-Word file. Normally, catdoc checks if first 8
      bytes of file is Microsoft OLE signature. If so, it processes file,
      otherwise it just copies it to stdin. It is intended to use catdoc
      as filter for viewing all files with .doc extension.
 
  - -dcharset
 
  - - specifies destination charset name. Charset file has format described in
      CHARACTER SETS below and should have .txt extension and reside in
      catdoc library directory ( ${exec_prefix}/lib/catdoc). By default,
      current locale charset is used if langinfo support compiled in.
 
  - -fformat
 
  - - specifies output format as described in CHARACTER SUBSTITUTION below.
      catdoc comes with two output formats - ascii and tex. You can add
      your own if you wish.
 
  - -l
 
  - Causes catdoc to list names of available charsets to the stdout and
      exit successfully.
 
  - -mnumber
 
  - Specifies right margin for text (default 72). -m 0 is equivalent to
      -w
 
  - -scharset
 
  - Specifies source charset. (one used in Word document), if Word document
      doesn't contain UTF-16 text. When reading rtf documents, it is typically
      not necessary, because rtf documents contain ansicpg specification. But it
      can be set wrong by Word (I've seen RTF documents on Russian, where cp1252
      was specified). In this case this option would take precedence over
      charset, specified in the document. But source_charset statement in the
      configuration file have less priority than charset in the document.
 
  - -t
 
  - - shortcut for -f tex
    
     converts all printable chars, which have special meaning for
      LaTeX(1) into appropriate control sequences. Separates table
      columns by &. 
  - -u
 
  - - declares that Word document contain UNICODE (UTF-16) representation of
      text (as some Word-97 documents). If catdoc fails to correct Word document
      with default charset, try this option.
 
  - -8
 
  - - declares is Word document is 8 bit. Just in case that catdoc
    
     recognizes file format incorrectly. 
  - -w
 
  - disables word wrapping. By default catdoc output is split into
      lines not longer than 72 (or number, specified by -m option) characters
      and paragraphs are separated by blank line. With this option each
      paragraph is one long line.
 
  - -x
 
  - causes catdoc to output unknown UNICODE character as \xNNNN, instead of
      question marks.
 
  - -v
 
  - causes catdoc to print some useless information about word document
      structure to stdout before actual start of text.
 
  - -V
 
  - outputs catdoc version
    
  
 
When processing MS-Word file catdoc uses information about
    two character sets, typically different
  
   - input and output. They are stored in plain text files in catdoc
    library directory. Character set files should contain two
    whitespace-separated hexadecimal numbers - 8-bit code in character set and
    16-bit Unicode code. Anything from hash mark to end of line is ignored, as
    well as blank lines.
catdoc distribution includes some of these character sets.
    Additional character set definitions, directly usable by catdoc can
    be obtained from ftp.unicode.org. Charset files have .txt suffix,
    which shouldn't be specified in command-line or configuration files.
Note that catdoc is distributed with Cyrillic charsets as
    default. If you are not Russian, you probably don't want it, an should
    reconfigure catdoc at compile time or in runtime configuration file.
When dealing with documents with charsets other than default,
    remember that Microsoft never uses ISO charsets. While letters in, say
    cp1252 are at the same position as in ISO-8859-1, some punctuation signs
    would be lost, if you specify ISO-8859-1 as input charset. If you use
    cp1252, catdoc would deal with those signs as described in CHARACTER
    SUBSTITUTION below.
catdoc converts MS-Word file into following internal
    Unicode representation:
  - 1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)
 
  
  - 2. Table cells within row are separated by ASCII Field Separator
    symbol
 
  - (0x001C)
 
  - 3. Table rows are separated by ASCII Record Separator (0x001E)
 
  
  - 4. All printable characters, including whitespace are represented with
    their
 
  - respective UNICODE codes.
 
This UNICODE representation is subsequently converted into 8-bit
    text in target character set using following four-step algorithm:
  - 1. List of special characters is searched for given Unicode
    character.
 
  - If found, then appropriate multi-character sequence is output instead of
      character.
 
  - 2. If there is an equivalent in target character set, it is output.
 
  
  - 3. Otherwise, replacement list is searched and, if there is
    multi-character
 
  - substitution for this UNICODE char, it is output.
 
  - 4. If all above fails, "Unknown char" symbol (question mark) is
    output.
 
  
Lists of special characters and list of substitution are character
    set-independent, because special chars should be escaped regardless of their
    existence in target character set (usually, they are parts of US-ASCII, and
    therefore exist in any character set) and replacement list is searched only
    for those characters, which are not found in target character set.
These lists are stored in catdoc library directory in files
    with prefix of format name. These files have following format:
Each line can be either comment (starting with hash mark) or
    contain hexadecimal UNICODE value, separated by whitespace from string,
    which would be substituted instead of it. If string contain no whitespace it
    can be used as is, otherwise it should be enclosed in single or double
    quotes. Usual backslash sequences like '\n','\t' can be used
    in these string.
Upon startup catdoc reads its system-wide configuration file (
    catdocrc in catdoc library directory) and then user-specific
    configuration file ${HOME}/.catdocrc.
These files can contain following directives:
  - source_charset
    = charset-name
 
  - Sets default source charset, which would be used if no -s option
      specified. Consult configuration of nearby windows workstation to find one
      you need.
 
  - target_charset
    = charset-name
 
  - 
    
     Sets default output charset. You probably know, which one you use. 
  - charset_path
    = directory-list
 
  - colon-separated list of directories, which are searched for charset files.
      This allows you to install additional charsets in your home directory. If
      first directory component of path is ~ it is replaced by contents of
      HOME environment variable. On MS-DOS platform, if directory name
      starts with %s, it is replaced with directory of executable file. Empty
      element in list (i.e. two consequitve colons) is considered current
      directory.
 
  - map_path =
    directory-list
 
  - colon-separated list of directories, which are searched for special
      character map and replacement map. Same substitution rules as in
      charset_path are applied.
 
  - format = format
    name
 
  - Output format which would be used by default. catdoc comes with two
      formats - ascii and tex but nothing prevents you from
      writing your own format (set two map files - special character map and
      replacement map).
 
  - unknown_char
    = character specification
 
  - sets character to output instead of unknown Unicode character (default
      '?') Character specification can have one of two form - character enclosed
      in single quotes or hexadecimal code.
 
  - use_locale
    =(yes|no)
 
  - Enables or disables automatic selection of output charset (default
      yes),
    
     based on system locale settings (if enabled at compile time). If automatic
      detection is enabled, than output charset settings in the configuration
      files (but not in the command line) are ignored, and current system locale
      charset is used instead. There are no automatic choice of input charset,
      based of locale language, because most modern Word files (since Word 97)
      are Unicode anyway
    
   
Doesn't handle fast-saves properly. Prints footnotes as separate
    paragraphs at the end of file, instead of producing correct LaTeX commands.
    Cannot distinguish between empty table cell and end of table row.
xls2csv(1), catppt(1), cat(1),
    strings(1), utf(4), unicode(4)
V.B.Wagner <vitus@45.free.net>