Manual Reference Pages - HTML (3)
- HTML parser
Item* parsehtml(uchar* data, int datalen, Rune* src, int mtype,
int chset, Docinfo** pdi)
void printitems(Item* items, char* msg)
int validitems(Item* items)
void freeitems(Item* items)
void freedocinfo(Docinfo* d)
int dimenkind(Dimen d)
int dimenspec(Dimen d)
int targetid(Rune* s)
Rune* targetname(int targid)
uchar*fromStr(Rune* buf, int n, int chset)
Rune* toStr(uchar* buf, int n, int chset)
This library implements a parser for HTML 4.0 documents.
The parsed HTML is converted into an intermediate representation that
describes how the formatted HTML should be laid out.
Parsehtml parses an entire HTML document contained in the buffer
data and having length
datalen. The URL of the document should be passed in as
Mtype is the media type of the document, which should be either
TextPlain. The character set of the document is described in
chset, which can be one of
Unicode. The return value is a linked list of
Item structures, described in detail below.
As a side effect,
*pdi is set to point to a newly created
Docinfo structure, containing information pertaining to the entire document.
The library expects two allocation routines to be provided by the
erealloc. These routines are analogous to the standard malloc and realloc routines,
except that they should not return if the memory allocation fails.
emalloc is required to zero the memory.
For debugging purposes,
printitems may be called to display the contents of an item list; individual items may
be printed using the
%I print verb, installed on the first call to
validitems traverses the item list, checking that all of the pointers are valid.
1 is everything is ok, and
0 if an error was found.
Normally, one would not call these routines directly.
Instead, one sets the global variable
dbgbuild and the library calls them automatically.
One can also set
warn, to cause the library to print a warning whenever it finds a problem with the
input document, and
dbglex, to print debugging information in the lexer.
When an item list is finished with, it should be freed with
freedocinfo should be called on the pointer returned in
dimenspec are provided to interpret the
Dimen type, as described in the section
Frame target names are mapped to integer ids via a global, permanent mapping.
To find the value for a given name, call
targetid, which allocates a new id if the name hasnt been seen before.
The name of a given, known id may be retrieved using
targetname. The library predefines
The library handles all text as Unicode strings (type
Rune*). Character set conversion is provided by
n Unicode characters from
buf and converts them to the character set described by
n bytes from
buf, interpretted as belonging to character set
chset, and converts them to a Unicode string.
Both routines null-terminate the result, and use
emalloc to allocate space for it.
The return value of
parsehtml is a linked list of variant structures,
with the generic portion described by the following definition:
typedef struct Item Item;
next points to the successor in the linked list of items, while
ascent are intended for use by the caller as part of the layout process.
Anchorid, if non-zero, gives the integer id assigned by the parser to the anchor that
this item is in (see section
State is a collection of flags and values described as follows:
IFbrk = 0x80000000,
IFbrksp = 0x40000000,
IFnobrk = 0x20000000,
IFcleft = 0x10000000,
IFcright = 0x08000000,
IFwrap = 0x04000000,
IFhang = 0x02000000,
IFrjust = 0x01000000,
IFcjust = 0x00800000,
IFsmap = 0x00400000,
IFindentmask = (255<<IFindentshift),
IFhangmask = 255
IFbrk is set if a break is to be forced before placing this item.
IFbrksp is set if a 1 line space should be added to the break (in which case
IFbrk is also set).
IFnobrk is set if a break is not permitted before the item.
IFcleft is set if left floats should be cleared (that is, if the list of pending left floats should be placed)
before this item is placed, and
IFcright is set for right floats.
In both cases, IFbrk is also set.
IFwrap is set if the line containing this item is allowed to wrap.
IFhang is set if this item hangs into the left indent.
IFrjust is set if the line containing this item should be right justified,
IFcjust is set for center justified lines.
IFsmap is used to indicate that an image is a server-side map.
The low 8 bits, represented by
IFhangmask, indicate the current hang into left indent, in tenths of a tabstop.
The next 8 bits, represented by
IFindentshift, indicate the current indent in tab stops.
genattr is an optional pointer to an auxiliary structure, described in the section
tag describes which variant type this item has.
It can have one of the values
Ispacertag. For each of these values, there is an additional structure defined, which
includes Item as an unnamed initial substructure, and then defines additional
Items of type
Itexttag represent a piece of text, using the following structure:
s is a null-terminated Unicode string of the actual characters making up this text item,
fnt is the font number (described in the section
Font Numbers), and
fg is the RGB encoded color for the text.
Voff measures the vertical offset from the baseline; subtract
Voffbias to get the actual value (negative values represent a displacement down the page).
ul is the underline style:
ULnone if no underline,
ULunder for conventional underline, and
ULmid for strike-through.
Items of type
Iruletag represent a horizontal rule, as follows:
align is the alignment specification (described in the corresponding section),
noshade is set if the rule should not be shaded,
size is the height of the rule (as set by the size attribute),
wspec is the desired width (see section
Items of type
Iimagetag describe embedded images, for which the following structure is defined:
imsrc is the URL of the image source,
imheight, if non-zero, contain the specified width and height for the image,
altrep is the text to use as an alternative to the image, if the image is not displayed.
Map, if set, points to a structure describing an associated client-side image map.
Ctlid is reserved for use by the application, for handling animated images.
Align encodes the alignment specification of the image.
Hspace contains the number of pixels to pad the image with on either side, and
Vspace the padding above and below.
Border is the width of the border to draw around the image.
Nextimage points to the next image in the document (the head of this list is
For items of type
Iformfieldtag, the following structure is defined:
This adds a single field,
formfield, which points to a structure describing a field in a form, described in section
For items of type
Itabletag, the following structure is defined:
Table points to a structure describing the table, described in the section
For items of type
Ifloattag, the following structure is defined:
item points to a single item (either a table or an image) that floats (the text of the
document flows around it), and
side indicates the margin that this float sticks to; it is either
y are reserved for use by the caller; these are typically used for the coordinates
of the top of the float.
Infloats is used by the caller to keep track of whether it has placed the float.
Nextfloat is used by the caller to link together all of the floats that it has placed.
For items of type
Ispacertag, the following structure is defined:
Spkind encodes the kind of spacer, and may be one of
ISPnull (zero height and width),
ISPvline (takes on height and ascent of the current font),
ISPhspace (has the width of a space in the current font) and
ISPgeneral (for all other purposes, such as between markers and lists).
The genattr field of an item, if non-nil, points to a structure that holds
the values of attributes not specific to any particular
item type, as they occur on a wide variety of underlying HTML tags.
The structure is as follows:
typedef struct Genattr Genattr;
title, when non-nil, contain values of correspondingly named attributes of the HTML tag
associated with this item.
Events is a linked list of events (with corresponding scripted actions) associated with the item:
typedef struct SEvent SEvent;
next points to the next event in the list,
type is one of
script is the text of the associated script.
Some structures include a dimension specification, used where
a number can be followed by a
% or a
* to indicate
percentage of total or relative weight.
This is encoded using the following structure:
typedef struct Dimen Dimen;
Separate kind and spec values are extracted using
Dimenkind returns one of
Dnone means that no dimension was specified.
In all other cases,
dimenspec should be called to find the absolute number of pixels, the percentage of total,
or the relative weight.
It is possible to set the background of the entire document, and also
for some parts of the document (such as tables).
This is encoded as follows:
typedef struct Background Background;
Image, if non-nil, is the URL of an image to use as the background.
If this is nil,
color is used instead, as the RGB value for a solid fill color.
Certain items have alignment specifiers taken from the following
ALnone = 0, ALleft, ALcenter, ALright, ALjustify,
ALchar, ALtop, ALmiddle, ALbottom, ALbaseline
These values correspond to the various alignment types named in the HTML 4.0
If an item has an alignment of
ALright, the library automatically encapsulates it inside a float item.
Tables, and the various rows, columns and cells within them, have a more
complex alignment specification, composed of separate vertical and
typedef struct Align Align;
Halign can be one of
Valign can be one of
Text items have an associated font number (the
fnt field), which is encoded as
style is one of
FntT, for roman, italic, bold and typewriter font styles, respectively, and size is
Verylarge. The total number of possible font numbers is
NumFnt, and the default font number is
DefFnt (which is roman style, normal size).
Global information about an HTML page is stored in the following structure:
typedef struct Docinfo Docinfo;
// stuff from HTTP headers, doc head, and body tag
// info needed to respond to user actions
Src gives the URL of the original source of the document,
base is the base URL.
Doctitle is the documents title, as set by a
Background is as described in the section
Background Specifications, and
backgrounditem is set to be an image item for the documents background image (if given as a URL),
or else nil.
Text gives the default foregound text color of the document,
link the unvisited hyperlink color,
vlink the visited hyperlink color, and
alink the color for highlighting hyperlinks (all in 24-bit RGB format).
Target is the default target frame id.
mediatype are as for the
mtype parameters to
Scripttype is the type of any scripts contained in the document, and is always
Hasscripts is set if the document contains any scripts.
Scripting is currently unsupported.
Refresh is the contents of a
<meta http-equiv=Refresh ...> tag, if any.
Kidinfo is set if this document is a frameset (see section
Frameid is this documents frame id.
Anchors is a list of hyperlinks contained in the document,
dests is a list of hyperlink destinations within the page (see the following section for details).
maps are lists of the various forms, tables and client-side maps contained
in the document, as described in subsequent sections.
Images is a list of all the image items in the document.
The library builds two lists for all of the
<a> elements (anchors) in a document.
Each anchor is assigned a unique anchor id within the document.
For anchors which are hyperlinks (the
href attribute was supplied), the following structure is defined:
typedef struct Anchor Anchor;
Next points to the next anchor in the list (the head of this list is
Index is the anchor id; each item within this hyperlink is tagged with this value
href are the values of the correspondingly named attributes of the anchor
(in particular, href is the URL to go to).
Target is the value of the target attribute (if provided) converted to a frame id.
Destinations within the document (anchors with the name attribute set)
are held in the
Docinfo.dests list, using the following structure:
typedef struct DestAnchor DestAnchor;
Next is the next element of the list,
index is the anchor id,
name is the value of the name attribute, and
item is points to the item within the parsed document that should be considered
to be the destination.
Any forms within a document are kept in a list, headed by
Docinfo.forms. The elements of this list are as follows:
typedef struct Form Form;
Next points to the next form in the list.
Formid is a serial number for the form within the document.
Name is the value of the forms name or id attribute.
Action is the value of any action attribute.
Target is the value of the target attribute (if any) converted to a frame target id.
Method is one of
Nfields is the number of fields in the form, and
fields is a linked list of the actual fields.
The individual fields in a form are described by the following structure:
typedef struct Formfield Formfield;
next points to the next field in the list.
Ftype is the type of the field, which can be one of
Fieldid is a serial number for the field within the form.
Form points back to the form containing this field.
cols each contain the values of corresponding attributes of the field, if present.
Flags contains per-field flags, of which
FFmultiple are defined.
Image is only used for fields of type
Fimage; it points to an image item containing the image to be displayed.
Ctlid is reserved for use by the caller, typically to store a unique id
of an associated control used to implement the field.
Events is the same as the corresponding field of the generic attributes
associated with the item containing this field.
Options is only used by fields of type
Fselect; it consists of a list of possible options that may be selected for that
field, using the following structure:
typedef struct Option Option;
Next points to the next element of the list.
Selected is set if this option is to be displayed initially.
Value is the value to send when the form is submitted if this option is selected.
Display is the string to display on the screen for this option.
The library builds a list of all the tables in the document,
Docinfo.tables. Each element of this list has the following format:
typedef struct Table Table;
Next points to the next element in the list of tables.
Tableid is a serial number for the table within the document.
Rows is an array of row specifications (described below) and
nrow is the number of elements in this array.
cols is an array of column specifications, and
ncol the size of this array.
Cells is a list of all cells within the table (structure described below)
ncell is the number of elements in this list.
Note that a cell may span multiple rows and/or columns, thus
ncell may be smaller than
Grid is a two-dimensional array of cells within the table; the cell
i and column
Table.grid[i][j]. A cell that spans multiple rows and/or columns will
be referenced by
grid multiple times, however it will only occur once in
Align gives the alignment specification for the entire table,
width gives the requested width as a dimension specification.
cellpadding give the values of the corresponding attributes for the table,
background gives the requested background for the table.
Caption is a linked list of items to be displayed as the caption of the
table, either above or below depending on whether
ALbottom. Most of the remaining fields are reserved for use by the caller,
tabletok, which is reserved for internal use.
Lay is not defined by the library; the caller can provide its
Tablecol structure is defined for use by the caller.
The library ensures that the correct number of these
is allocated, but leaves them blank.
The fields are as follows:
typedef struct Tablecol Tablecol;
The rows in the table are specified as follows:
typedef struct Tablerow Tablerow;
Next is only used during parsing; it should be ignored by the caller.
Cells provides a list of all the cells in a row, linked through their
nextinrow fields (see below).
pos are reserved for use by the caller.
Align is the alignment specification for the row, and
background is the background to use, if specified.
Flags is used by the parser; ignore this field.
The individual cells of the table are described as follows:
typedef struct Tablecell Tablecell;
Next is used to link together the list of all cells within a table
nextinrow is used to link together all the cells within a single row
Cellid provides a serial number for the cell within the table.
Content is a linked list of the items to be laid out within the cell.
Lay is reserved for the user to describe how these items have
been laid out.
colspan are the number of rows and columns spanned by this cell,
Align is the alignment specification for the cell.
Flags is some combination of
TFisth ord together.
TFparsing is used internally by the parser, and should be ignored.
TFnowrap means that the contents of the cell should not be
wrapped if they dont fit the available width,
rather, the table should be expanded if need be
(this is set when the nowrap attribute is supplied).
TFisth means that the cell was created by the
<th> element (rather than the
indicating that it is a header cell rather than a data cell.
Wspec provides a suggested width as a dimension specification,
hspec provides a suggested height in pixels.
Background gives a background specification for the individual cell.
pos are reserved for use by the caller during layout.
col give the indices of the row and column of the top left-hand
corner of the cell within the table grid.
The library builds a list of client-side maps, headed by
Docinfo.maps, and having the following structure:
typedef struct Map Map;
Next points to the next element in the list,
name is the name of the map (use to bind it to an image), and
areas is a list of the areas within the image that comprise the map,
using the following structure:
typedef struct Area Area;
Next points to the next element in the maps list of areas.
Shape describes the shape of the area, and is one of
Href is the URL associated with this area in its role as
a hypertext link, and
target is the target frame it should be loaded in.
Coords is an array of coordinates for the shape, and
ncoords is the size of this array (number of elements).
Docinfo.kidinfo field is set, the document is a frameset.
In this case, it is typical for
parsehtml to return nil, as a document which is a frameset should have no actual
items that need to be laid out (such will appear only in subsidiary documents).
It is possible that items will be returned by a malformed document; the caller
should check for this and free any such items.
Kidinfo structure itself reflects the fact that framesets can be nested within a document.
If is defined as follows:
typedef struct Kidinfo Kidinfo;
// fields for "frame"
// fields for "frameset"
Next is only used if this structure is part of a containing frameset; it points to the next
element in the list of children of that frameset.
Isframeset is set when this structure represents a frameset; if clear, it is an individual frame.
Some fields are used only for framesets.
Rows is an array of dimension specifications for rows in the frameset, and
nrows is the length of this array.
Cols is the corresponding array for columns, of length
Kidinfos points to a list of components contained within this frameset, each
of which may be a frameset or a frame.
Nextframeset is only used during parsing, and should be ignored.
The remaining fields are used if the structure describes a frame, not a frameset.
Src provides the URL for the document that should be initially loaded into this frame.
Note that this may be a relative URL, in which case it should be interpretted
using the containing documents URL as the base.
Name gives the name of the frame, typically supplied via a name attribute in the HTML.
If no name was given, the library allocates one.
framebd are the values of the marginwidth, marginheight and frameborder attributes, respectively.
Flags can contain some combination of the following:
FRnoresize (the frame had the noresize attribute set, and the user should not be allowed to resize it),
FRnoscroll (the frame should not have any scroll bars),
FRhscroll (the frame should have a horizontal scroll bar),
FRvscroll (the frame should have a vertical scroll bar),
FRhscrollauto (the frame should be automatically given a horizontal scroll bar if its contents
would not otherwise fit), and
FRvscrollauto (the frame gets a vertical scrollbar only if required).
W3C World Wide Web Consortium,
HTML 4.01 Specification.
The entire HTML document must be loaded into memory before
any of it can be parsed.
Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.