![]() |
![]()
| ![]() |
![]()
NAMESGML::Parser::OpenSP - Parse SGML documents using OpenSP SYNOPSISuse SGML::Parser::OpenSP; my $p = SGML::Parser::OpenSP->new; my $h = ExampleHandler->new; $p->catalogs(qw(xhtml.soc)); $p->warnings(qw(xml valid)); $p->handler($h); $p->parse("example.xhtml"); DESCRIPTIONThis module provides an interface to the OpenSP SGML parser. OpenSP and this module are event based. As the parser recognizes parts of the document (say the start or end of an element), then any handlers registered for that type of an event are called with suitable parameters. COMMON METHODS
CONFIGURATIONBOOLEAN OPTIONS
ERROR MESSAGE FORMAT
GENERATED EVENTS
IO SETTINGS
PROCESSING OPTIONS
ENABLING WARNINGSAdditional warnings can be enabled using $p->warnings([@warnings]) The following values can be used to enable warnings:
DISABLING WARNINGSA warning can be disabled by using its name prefixed with "no-". Thus calling warnings(qw(all no-duplicate)) will enable all warnings except those about duplicate entity declarations. The following values for warnings() disable errors:
XML WARNINGSThe following warnings are turned on for the "xml" warning described above:
PROCESSING FILESIn order to start processing of a document and recieve events, the "parse" method must be called. It takes one argument specifying the path to a file (not a file handle). You must set an event handler using the "handler" method prior to using this method. The return value of "parse" is currently undefined. EVENT HANDLERSIn order to receive data from the parser you need to write an event handler. For example, package ExampleHandler; sub new { bless {}, shift } sub start_element { my ($self, $elem) = @_; printf " * %s\n", $elem->{Name}; } This handler would print all the element names as they are found in the document, for a typical XHTML document this might result in something like * html * head * title * body * p * ... The events closely match those in the generic interface to OpenSP, see <http://openjade.sf.net/doc/generic.htm> for more information. The event names have been changed to lowercase and underscores to separate words and properties are capitalized. Arrays are represented as Perl array references. "Position" information is not passed to the handler but made available through the "get_location" method which can be called from event handlers. Some redundant information has also been stripped and the generic identifier of an element is stored in the "Name" hash entry. For example, for an EndElementEvent the "end_element" handler gets called with a hash reference { Name => 'gi' } The following events are defined: * appinfo * processing_instruction * start_element * end_element * data * sdata * external_data_entity_ref * subdoc_entity_ref * start_dtd * end_dtd * end_prolog * general_entity # set $p->output_general_entities(1) * comment_decl # set $p->output_comment_decls(1) * marked_section_start # set $p->output_marked_sections(1) * marked_section_end # set $p->output_marked_sections(1) * ignored_chars # set $p->output_marked_sections(1) * error * open_entity_change If the documentation of the generic interface to OpenSP states that certain data is not valid, it will not be available through this interface (i.e., the respective key does not exist in the hash ref). POSITIONING INFORMATIONEvent handlers can call the "get_location" method on the parser object to retrieve positioning information, the get_location method will return a hash reference with the following properties: LineNumber => ..., # line number ColumnNumber => ..., # column number ByteOffset => ..., # number of preceding bytes EntityOffset => ..., # number of preceding bit combinations EntityName => ..., # name of the external entity FileName => ..., # name of the file These can be "undef" or an empty string. POST-PROCESSING ERROR MESSAGESOpenSP returns error messages in form of a string rather than individual components of the message like line numbers or message text. The "split_message" method on the parser object can be used to post-process these error message strings as reliable as possible. It can be used e.g. from an error event handler if the parser object is accessible like sub error { my $self = shift; my $erro = shift; my $mess = $self->{parser}->split_message($erro); } See the documentation of "split_message" in the SGML::Parser::OpenSP::Tools documentation. UNICODE SUPPORTAll strings returned from event handlers and helper routines are UTF-8 encoded with the UTF-8 flag turned on, helper functions like "split_message" expect (but don't check) that string arguments are UTF-8 encoded and have the UTF-8 flag turned on. Behavior of helper functions is undefined when you pass unexpected input and should be avoided. "parse" has limited support for binary input, but the binary input must be compatible with OpenSP's generic interface requirements and you must specify the encoding through means available to OpenSP to enable it to properly decode the binary input. Any encoding meta data about such binary input specific to Perl (such as encoding disciplines for file handles when you pass a file descriptor) will be ignored. For more specific information refer to the OpenSP manual.
ENVIRONMENT VARIABLESOpenSP supports a number of environment variables to control specific processing aspects such as "SGML_SEARCH_PATH" or "SP_CHARSET_FIXED". Portable applications need to ensure that these are set prior to loading the OpenSP library into memory which happens when the XS code is loaded. This means you need to wrap the code into a "BEGIN" block: BEGIN { $ENV{SP_CHARSET_FIXED} = 1; } use SGML::Parser::OpenSP; # ... Otherwise changes to the environment might not propagate to OpenSP. This applies specifically to Win32 systems.
Note that you can use the "search_dirs" method instead of using "SGML_SEARCH_PATH" and the "catalogs" method instead of using "SGML_CATALOG_FILES" and attributes on storage object specifications for "SP_BCTF" and "SP_ENCODING" respectively. For example, if "SP_CHARSET_FIXED" is set to 1 you can use $p->parse("<OSFILE encoding='UTF-8'>example.xhtml"); to process "example.xhtml" using the "UTF-8" character encoding. KNOWN ISSUESOpenSP must be compiled with "SP_MULTI_BYTE" defined and with "SP_WIDE_SYSTEM" undefined, this module will otherwise break at runtime or not compile. BUG REPORTSPlease report bugs in this module via <http://rt.cpan.org/NoAuth/Bugs.html?Dist=SGML-Parser-OpenSP> Please report bugs in OpenSP via <http://sf.net/tracker/?group_id=2115&atid=102115> Please send comments and questions to the spo-devel mailing list, see <http://lists.sf.net/lists/listinfo/spo-devel> for details. SEE ALSO
AUTHORSTerje Bless <link@cpan.org> wrote version 0.01. Bjoern Hoehrmann <bjoern@hoehrmann.de> wrote version 0.02+. COPYRIGHT AND LICENSECopyright (c) 2006-2008 Bjoern Hoehrmann <bjoern@hoehrmann.de>. This module is licensed under the same terms as Perl itself.
|