html2xhtml - Converts HTML files to XHTML
html2xhtml [ filename ] [ options ]
Html2xhtml is a command-line tool that converts HTML files to XHTML files. The
path of the HTML input file can be provided as a command-line argument. If
not, it is read from stdin.
Xhtml2xhtml tries always to generate valid XHTML files. It is able
to correct many common errors in input HTML files without loose of
information. However, for some errors, html2xhtml may decide to loose some
information in order to generate a valid XHTML output. This can be avoided
with the -e option, which allows html2xhtml to generate non-valid
output in these cases.
Html2xhtml can generate the XHTML output compliant to one of the
following document types: XHTML 1.0 (Transitional, Strict and Frameset),
XHTML 1.1, XHTML Basic and XHTML Mobile Profile.
The command line options/arguments are:
- filename
- Read the HTML input from filename (optional argument). If this
argument is not provided, the HTML input is read from standard input.
- -o filename
- Output XHTML file. The file is overwritten if it exists. If not provided,
the output is written to standard output.
- -e
- Instructs the program to propagate input chunks to the output even if it
is unable to adapt them to the output XHTML doctype. Using this option,
the XHTML output may be non-valid. Not using this option, some input data
could be removed from the output in some [rare] cases.
- -t output-doctype
- Doctype of the output XHTML file. If not specified, the program selects
automatically either XHTML 1.0 Transitional or XHTML 1.0 Frameset
depending on the input. Current available doctypes are:
o transitional XHTML 1.0 Transitional
o frameset XHTML 1.0 Frameset
o strict XHTML 1.0 Strict
o 1.1 XHTML 1.1
o basic-1.0 XHTML Basic 1.0
o basic-1.1 XHTML Basic 1.1
o mp XHTML Mobile Profile
o print-1.0 XHTML Print 1.0
- --ics input_charset
- Character set of the input document. This option overrides the default
input character set detection mechanism.
- --ocs output_charset
- Character set for the output XHTML document. If this option is not
present, the character set of the input is used as default.
- --lcs
- Dump the list of available character set aliases and exit html2xhtml. No
conversion is performed when this option is present.
- -l line_length
- Number of characters per line. The default value is 80. It must be greater
or equal to 40, otherwise the parameter is ignored.
- -b tab_length
- Tab length in number of characters. It must be a number between 0 and 16,
otherwise the parameter is ignored. Use 0 to avoid indentation in the
output.
- --preserve-space-comments
- Use this option to preserve white spaces, tabs and ends of lines in HTML
comments. The default, if not provided, is to rearrange spacing.
- --no-protect-cdata
- Enclose CDATA sections in "script" and "style"
following the XHTML 1.0 specification (using "<!CDATA[[" and
"]]>"). It might be incompatible with some browsers. The
default in this version is to enclose CDATA sections using
"//<!CDATA[[" and "//]]>", because major
browsers handle it properly.
- --compact-block-elements
- No white spaces or line breaks are written between the start tag of a
block element and the start tag of its first enclosed inline element (or
character data) and between the end tag of its last enclosed inline
element (or character data) and the end tag of the block element. By
default, if this option is not set, a new line character and indentation
is written between them.
- --compact-empty-elm-tags
- Do not write a whitespace before the slash for empty element tags (i.e.
write "<br/>" instead of the default "<br
/>"). Note that although both notations are correct in XML, the
XHTML 1.0 standard recommends the latter to improve compatibility with old
browsers.
- --empty-elm-tags-always
- By default, empty element tags are written only for elements declared as
empty in the DTD. This option makes any element not having content to be
written with the empty element tag, even if it is not declared as empty in
the DTD. This option may cause problems when the XHTML document is opened
by browsers in HTML (tag soup) mode.
- --dos-eol
- Write the output XHTML file with DOS--style (CRLF) end of line, instead of
the default UNIX--style end of line. Both end of line styles are allowed
by the XML recommendation.
- --generate-snippet
- Treat the input as an HTML fragment instead of a full document. The output
will also be a snippet and will not contain either the XML and doctype
declarations or the html, head and body elements.
- --help
- Show a brief help message and exit.
- --version
- Show the version number and exit.
Since version 1.1.2, html2xhtml is able to parse and write HTML and XHTML
documents using the most popular character sets / encodings. It is also able
to read the input document using a given character set and generate an output
that uses a different character set. The iconv implementation in the GNU C
library is used with that purpose.
Any IANA-registered character set that is supported by the iconv
library may be used. When naming a character set, any IANA--approved alias
for it may be used. The full list of aliases recognised by html2xhtml can be
obtained with the --lcs command-line option.
If the character set of the input document is not specified,
html2xhtml tries to guess it automatically. If the character set of the
output document is not specified, html2xhtml writes the output using the
same character set as the input document.
By default, the UNIX-style one-byte end of line is used. It can be changed to
DOS-style CRLF end of line with the --dos-eol command-line option.
However, when the program is compiled in the MinGW environment and
the output is sent to standard output, the output is automatically converted
by the environment to CRLF by default. Do not use the --dos-eol
command-line option in that situation. When the output is sent to a file
with the -o command-line option, the output is as expected
(UNIX-style by default), and the --dos-eol option may be used.
Program developer up to current version:
Jesus Arias Fisteus <jaf@it.uc3m.es>
The first working version of this program has been developed as
a Master Thesis at the University of Vigo (Spain) [http://www.uvigo.es],
advised by:
Rebeca Diaz Redondo
Ana Fernandez Vilas
Copyright 2000-2001 by Jesus Arias Fisteus, Rebeca Diaz Redondo, Ana
Fernandez Vilas.
Copyright 2002-2009 by Jesus Arias Fisteus