|LocalName||The name of the element minus any namespace prefix it may have come with in the document.|
|NamespaceURI||The URI of the namespace associated with this element, or the empty string for none.|
|Attributes||A set of attributes as described below.|
|Name||The name of the element as it was seen in the document (i.e. including any prefix associated with it)|
|Prefix||The prefix used to qualify this elements namespace, or the empty string if none.|
The value of each entry in the attributes hash is another hash structure consisting of:
|LocalName||The name of the attribute minus any namespace prefix it may have come with in the document.|
|NamespaceURI||The URI of the namespace associated with this attribute. If the attribute had no prefix, then this consists of just the empty string.|
|Name||The attributes name as it appeared in the document, including any namespace prefix.|
|Prefix||The prefix used to qualify this attributes namepace, or the empty string if none.|
|Value||The value of the attribute.|
The end_element handler is called either when a parser sees a closing tag, or after start_element has been called for an empty element (do note however that a parser may if it is so inclined call characters with an empty string when it sees an empty element. There is no simple way in SAX to determine if the parser in fact saw an empty element, a start and end element with no content..
The end_element handler receives exactly the same structure as start_element, minus the Attributes entry. One must note though that it should not be a reference to the same data as start_element receives, so you may change the values in start_element but this will not affect the values later seen by end_element.
The characters callback may be called in serveral circumstances. The most obvious one is when seeing ordinary character data in the markup. But it is also called for text in a CDATA section, and is also called in other situations. A SAX parser has to make no guarantees whatsoever about how many times it may call characters for a stretch of text in an XML document - it may call once, or it may call once for every character in the text. In order to work around this it is often important for the SAX developer to use a bundling technique, where text is gathered up and processed in one of the other callbacks. This is not always necessary, but it is a worthwhile technique to learn, which we will cover in XML::SAX::Advanced (when I get around to writing it).
The characters handler is called with a very simple structure - a hash reference consisting of just one entry:
Data The text data that was received.
The comment callback is called for comment text. Unlike with characters(), the comment callback *must* be invoked just once for an entire comment string. It receives a single simple structure - a hash reference containing just one entry:
Data The text of the comment.
The processing instruction handler is called for all processing instructions in the document. Note that these processing instructions may appear before the document root element, or after it, or anywhere where text and elements would normally appear within the document, according to the XML specification.
The handler is passed a structure containing just two entries:
Target The target of the processing instrcution Data The text data in the processing instruction. Can be an empty string for a processing instruction that has no data element. For example <?wiggle?> is a perfectly valid processing instruction.
What we have discussed above is really the tip of the SAX iceberg. And so far it looks like theres not much of interest to SAX beyond what we have seen with XML::Parser. But it does go much further than that, I promise.
People who hate Object Oriented code for the sake of it may be thinking here that creating a new package just to parse something is a waste when theyve been parsing things just fine up to now using procedural code. But theres reason to all this madness. And that reason is SAX Filters.
As you saw right at the very start, to let the parser know about our class, we pass it an instance of our class as the Handler to the parser. But now imagine what would happen if our class could also take a Handler option, and simply do some processing and pass on our data further down the line? That in a nutshell is how SAX filters work. Its Unix pipes for the 21st century!
There are two downsides to this. Number 1 - writing SAX filters can be tricky. If you look into the future and read the advanced tutorial Im writing, youll see that Handler can come in several shapes and sizes. So making sure your filter does the right thing can be tricky. Secondly, constructing complex filter chains can be difficult, and simple thinking tells us that we only get one pass at our document, when often well need more than that.
The first module is XML::SAX::Base. This is a VITAL SAX module that acts as a base class for all SAX parsers and filters. It provides an abstraction away from calling the handler methods, that makes sure your filter or parser does the right thing, and it does it FAST. So, if you ever need to write a SAX filter, which if youre processing XML -> XML, or XML -> HTML, then you probably do, then you need to be writing it as a subclass of XML::SAX::Base. Really - this is advice not to ignore lightly. I will not go into the details of writing a SAX filter here. Kip Hampton, the author of XML::SAX::Base has covered this nicely in his article on XML.com here <URI>.
To construct SAX pipelines, Barrie Slaymaker, a long time Perl hacker whose modules you will probably have heard of or used, wrote a very clever module called XML::SAX::Machines. This combines some really clever SAX filter-type modules, with a construction toolkit for filters that makes building pipelines easy. But before we see how it makes things easy, first lets see how tricky it looks to build complex SAX filter pipelines.
use XML::SAX::ParserFactory; use XML::Filter::Filter1; use XML::Filter::Filter2; use XML::SAX::Writer; my $output_string; my $writer = XML::SAX::Writer->new(Output => \$output_string); my $filter2 = XML::SAX::Filter2->new(Handler => $writer); my $filter1 = XML::SAX::Filter1->new(Handler => $filter2); my $parser = XML::SAX::ParserFactory->parser(Handler => $filter1); $parser->parse_uri("foo.xml");
This is a lot easier with XML::SAX::Machines:
One of the main benefits of XML::SAX::Machines is that the pipelines are constructed in natural order, rather than the reverse order we saw with manual pipeline construction. XML::SAX::Machines takes care of all the internals of pipe construction, providing you at the end with just a parser you can use (and you can re-use the same parser as many times as you need to).
Just a final tip. If you ever get stuck and are confused about what is being passed from one SAX filter or parser to the next, then Devel::TraceSAX will come to your rescue. This perl debugger plugin will allow you to dump the SAX stream of events as it goes by. Usage is really very simple just call your perl script that uses SAX as follows:
$ perl -d:TraceSAX <scriptname>
Matt Sergeant, email@example.com
|perl v5.20.3||SAX::INTRO (3)||2009-10-10|