The Tokenizer maintains the current token class at all times. Much of the
time is just going to be the Whitespace class, which is what the base of
a document is. As the tokenizer executes the various character handlers,
the class changes a lot as it moves a long. In fact, in some instances,
the character handler may not handle the character directly itself, but
rather change the current class and then hand off to the character
handler for the new class.
Because of this, and some other things Ill deal with later, the number of times the character handlers are called does not in fact have a direct relationship to the number of actual characters in the document.
Rather than create a class stack to allow for infinitely nested layers of
classes, the Tokenizer recognises just a single layer.
To put it a different way, in various parts of the file, the Tokenizer will recognise different base or substrate classes. When a Token such as a comment or a number is finalised by the tokenizer, it falls back to the base state.
This allows proper tokenization of special areas such as __DATA__ and __END__ blocks, which also contain things like comments and POD, without allowing the creation of any significant Tokens inside these areas.
For the main part of a document we use PPI::Token::Whitespace for this, with the idea being that code is floating in a sea of whitespace.
The final main state variable is the current token. This is the Token
that is currently being built by the Tokenizer. For certain types, it
can be manipulated and morphed and change class quite a bit while being
assembled, as the Tokenizers understanding of the token content changes.
When the Tokenizer is confident that it has seen the end of the Token, it will be finalized, which adds it to the output token array and resets the current class to that of the zone that we are currently in.
I should also note at this point that the current token variable is optional. The Tokenizer is capable of knowing what class it is currently set to, without actually having accumulated any characters in the Token.
As Im sure you can imagine, calling several different methods for each character and running regexes and other complex heuristics made the first fully working version of the tokenizer extremely slow.
During testing, I created a metric to measure parsing speed called LPGC, or lines per gigacycle . A gigacycle is simple a billion CPU cycles on a typical single-core CPU, and so a Tokenizer running at 1000 lines per gigacycle should generate around 1200 lines of tokenized code when running on a 1200 MHz processor.
The first working version of the tokenizer ran at only 350 LPGC, so to tokenize a typical large module such as ExtUtils::MakeMaker took 10-15 seconds. This sluggishness made it unpractical for many uses.
So in the current parser, there are multiple layers of optimisation very carefully built in to the basic. This has brought the tokenizer up to a more reasonable 1000 LPGC, at the expense of making the code quite a bit twistier.
The first step in the optimisation process was to add a hew handler to enable several of the more basic classes (whitespace, comments) to be able to be parsed a line at a time. At the start of each line, a special optional handler (only supported by a few classes) is called to check and see if the entire line can be parsed in one go.
This is used mainly to handle things like POD, comments, empty lines, and a few other minor special cases.
The second stage of the optimisation involved inlining a small number of critical methods that were repeated an extremely high number of times. Profiling suggested that there were about 1,000,000 individual method calls per gigacycle, and by cutting these by two thirds a significant speed improvement was gained, in the order of about 50%.
You may notice that many methods in the PPI::Tokenizer code look very nested and long hand. This is primarily due to this inlining.
At around this time, some statistics code that existed in the early versions of the parser was also removed, as it was determined that it was consuming around 15% of the CPU for the entire parser, while making the core more complicated.
A judgment call was made that with the difficulties likely to be encountered with future planned enhancements, and given the relatively high cost involved, the statistics features would be removed from the Tokenizer.
Once inlining had reached diminishing returns, it became obvious from the profiling results that a huge amount of time was being spent stepping a char at a time though long, simple and syntactically boring code such as comments and strings.
The existing regex engine was expanded to also encompass quotes and other quote-like things, and a special abstract base class was added that provided a number of specialised parsing methods that would scan ahead, looking out ahead to find the end of a string, and updating the cursor to leave it in a valid position for the next call.
This is also the point at which the number of character handler calls began to greatly differ from the number of characters. But it has been done in a way that allows the parser to retain the power of the original version at the critical points, while skipping through the boring bits as needed for additional speed.
The addition of this feature allowed the tokenizer to exceed 1000 LPGC for the first time.
As it became evident that great speed increases were available by using this skipping ahead mechanism, a new handler method was added that explicitly handles the parsing of an entire token, where the structure of the token is relatively simple. Tokens such as symbols fit this case, as once we are passed the initial sigil and word char, we know that we can skip ahead and complete the rest of the token much more easily.
A number of these have been added for most or possibly all of the common cases, with most of these complete handlers implemented using regular expressions.
In fact, so many have been added that at this point, you could arguably reclassify the tokenizer as a hybrid regex, char-by=char heuristic tokenizer. More tokens are now consumed in complete methods in a typical program than are handled by the normal char-by-char methods.
Many of the these complete-handlers were implemented during the writing of the Lexer, and this has allowed the full parser to maintain around 1000 LPGC despite the increasing weight of the Lexer.
While it would be extraordinarily difficult to port all of the Tokenizer to C, work has started on a PPI::XS accelerator package which acts as a separate and automatically-detected add-on to the main PPI package.
PPI::XS implements faster versions of a variety of functions scattered over the entire PPI codebase, from the Tokenizer Core, Quote Engine, and various other places, and implements them identically in XS/C.
In particular, the skip-ahead methods from the Quote Engine would appear to be extremely amenable to being done in C, and a number of other functions could be cherry-picked one at a time and implemented in C.
Each method is heavily tested to ensure that the functionality is identical, and a versioning mechanism is included to ensure that if a function gets out of sync, PPI::XS will degrade gracefully and just not replace that single method.
- Add an option to reset or seek the token stream...
- Implement more Tokenizer functions in PPI::XS
See the support section in the main module.
Adam Kennedy <email@example.com>
Copyright 2001 - 2011 Adam Kennedy.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
|perl v5.20.3||PPI::TOKENIZER (3)||2014-11-11|