hxpipe - convert XML file to a format easier to parse with Perl or
    AWK
hxpipe [ -l ] [ -H ] [ -- ] [
    file-or-URL ]
hxpipe parses an HTML or XML file and outputs a
    line-oriented representation of it that is well suited to further processing
    with AWK or similar tools. The format is similar to the ESIS (Element
    Structure Information Set) that is output by nsgmls/onsgmls.
The reverse operation, converting back to mark-up, is performed by
    the hxunpipe program.
The output format is as follows:
  - <!--comment-->
 
  - Comments are output as
    
      
    
    
*comment
    
    I.e., a single line starting with "*" followed by
        the text of the comment. Line feeds, carriage returns and tabs in the
        text are written as "\n", "\r" and "\t",
        respectively. Text that looks like a numerical character entity is
        written with the "&" replaced by "\". The line
        ends with a line feed.
   
  - 
  
 
  - Note that onsgmls outputs comments starting with a "_" instead
      of a "*" and doesn't replace the "&" of numerical
      character entities by "\" (and by default it omits comments
      altogether).
 
  - <?processing instruction>
 
  - Processing instructions are output as
    
      
    
    
?processing instruction
    
    I.e., a single line starting with a "?" followed by
        the text of the processing instruction. The text is escaped as for
        comments (see above).
   
  - <!DOCTYPE root PUBLIC "-//foo//DTD bar//EN"
    "http://example.org/dtd">
 
  - DOCTYPEs are output as one of the following:
    
      
    
    
!root "-//foo//DTD bar//EN" http://example.org/dtd
!root "-//foo//DTD bar//EN"
!root "" http://example.org/dtd
!root ""
    
    for respectively: a DOCTYPE with (1) both a public and a
        system identifier, (2) only a public identifier, (3) only a system
        identifier, or (4) neither of the two. I.e., a single line starting with
        a "!", followed by a space and a possibly empty quoted string,
        followed optionally by a space and arbitrary text. Note the quotes for
        the public identifier and the absence of quotes for the system
        identifier.
   
  - <elt att1="value1" att2="value2">
 
  - A start tag is output as
    
      
    
    
Aatt1 CDATA value1
Aatt2 CDATA value2
(elt
    
    I.e., as zero or more lines for the attributes and one line
        for the element type. Each line for an attribute starts with
        "A" followed by the name of the attribute, a space, the
        literal string "CDATA", another space, and the attribute
        value. The text of the attribute value is escaped as for comments (see
        above). The line for the element type starts with "(" followed
        by the element type.
   
  - 
  
 
  - hxpipe does not read DTDs and assumes that attributes are always
      CDATA. It never generates other types (IMPLIED, TOKEN, ID, etc.), unlike
      onsgmls.
 
  - </elt>
 
  - End tags are output as
    
      
    
    
)elt
    
    I.e., as a line starting with ")" followed by the
        element type.
   
  - <empty att1="val1" att2="val2"/>
 
  - Empty elements are output as
    
      
    
    
Aatt1 CDATA val1
Aatt2 CDATA val2
|empty
    
    I.e., as zero or more lines for attributes and one line
        starting with "|" followed by the element type.
   
  - 
  
 
  - Note that onsgmls never outputs "|". (However, it can
      optionally output a line consisting of a single "e" just before
      the "(" line, to indicate that the element is empty.)
 
  - text
 
  - Text is output as
    
      
    
    
-text
    
    I.e., as a single line starting with a "-". The text
        is escaped as for comments (see above).
   
  - line numbers
 
  - When the -l option is in effect, hxpipe will intersperse the
      output with lines of the form
    
      
    
    
L12
    
    where "12" is replaced with the line number in the
        source where the next output came from.
   
hxpipe normalizes the input only in the sense that it
    outputs attributes in a fixed order (alphabetical, but not
    locale-dependent). It does not read a DTD and thus cannot remove redundant
    white space and cannot add implied attributes. It does not expand character
    entities. (But you can pipe the input through hxunent beforehand.) It
    also does not add implied tags. (But see the -H option.)
The following options are supported:
  - -l
 
  - Add "L" lines to the output to indicate the line numbers in the
      source. Currently does not work together with the -H
    
     option. 
  - -H
 
  - Apply special rules for HTML. Normally, hxpipe assumes well-formed
      XML. With this option, hxpipe will assume the input is HTML and
      will add implied tags, recognize empty elements and treat the contents of
      <script> and <style> elements as literal text.
 
The following operand is supported:
  - file-or-URL
 
  - The name or URL of an HTML file. If absent, standard input is read
      instead.
 
The following exit values are returned:
  - 0
 
  - Successful completion.
 
  - > 0
 
  - An error occurred in the parsing of the HTML file. hxpipe will try
      to correct the error and produce output anyway.
 
To use a proxy to retrieve remote files, set the environment
    variables http_proxy and ftp_proxy. E.g.,
    http_proxy="http://localhost:8080/"
The error recovery for incorrect HTML is primitive.
hxpipe can currently only retrieve remote files over HTTP.
    It doesn't handle password-protected files, nor files whose content depends
    on HTTP "cookies."
Option -l ought to work also with HTML input (option
    -H).
hxunpipe(1), hxunent(1), onsgmls(1).