Analyzing the contents of a Perl document to automatically generate
documentation, in parallel to, or as a replacement for, POD documentation.
Allow an indexer to locate and process all the comments and documentation from code for full text search applications.
|Structural and Quality Analysis||
Determine quality or other metrics across a body of code, and identify
situations relating to particular phrases, techniques or locations.
Index functions, variables and packages within Perl code, and doing search and graph (in the node/edge sense) analysis of large code bases.
|Refactoring||Make structural, syntax, or other changes to code in an automated manner, either independently or in assistance to an editor. This sort of task list includes backporting, forward porting, partial evaluation, improving code, or whatever. All the sort of things youd want from a Perl::Editor.|
|Layout||Change the layout of code without changing its meaning. This includes techniques such as tidying (like perltidy), obfuscation, compressing and squishing, or to implement formatting preferences or policies.|
|Presentation||This includes methods of improving the presentation of code, without changing the content of the code. Modify, improve, syntax colour etc the presentation of a Perl document. Generating IntelliText-like functions.|
PPI seeks to be good enough to achieve all of the above tasks, or to provide a sufficiently good API on which to allow others to implement modules in these and related areas.
However, there are going to be limits to this process. Because PPI cannot adapt to changing grammars, any code written using source filters should not be assumed to be parsable.
At one extreme, this includes anything munged by Acme::Bleach, as well as (arguably) more common cases like Switch. We do not pretend to be able to always parse code using these modules, although as long as it still follows a format that looks like Perl syntax, it may be possible to extend the lexer to handle them.
The ability to extend PPI to handle lexical additions to the language is on the drawing board to be done some time post-1.0
The goal for success was originally to be able to successfully parse 99% of all Perl documents contained in CPAN. This means the entire file in each case.
PPI has succeeded in this goal far beyond the expectations of even the author. At time of writing there are only 28 non-Acme Perl modules in CPAN that PPI is incapable of parsing. Most of these are so badly broken they do not compile as Perl code anyway.
So unless you are actively going out of your way to break PPI, you should expect that it will handle your code just fine.
PPI provides partial support for internationalisation and localisation.
Specifically, it allows the use characters from the Latin-1 character set to be used in quotes, comments, and POD. Primarily, this covers languages from Europe and South America.
If you need Unicode support, and would like to help stress test the Unicode support so we can move it to the main branch and enable it in the main release should contact the author. (contact details below)
When PPI parses a file it builds <B>everythingB> into the model, including whitespace. This is needed in order to make the Document fully Round Trip safe.
The general concept behind a Round Trip parser is that it knows what it is parsing is somewhat uncertain, and so <B>expectsB> to get things wrong from time to time. In the cases where it parses code wrongly the tree will serialize back out to the same string of code that was read in, repairing the parsers mistake as it heads back out to the file.
The end result is that if you parse in a file and serialize it back out without changing the tree, you are guaranteed to get the same file you started with. PPI does this correctly and reliably for 100% of all known cases.
The one minor exception at this time is that if the newlines for your file are wrong (meaning not matching the platform newline format), PPI will localise them for you. (It isnt to be convenient, supporting arbitrary newlines would make some of the code more complicated)
Better control of the newline type is on the wish list though, and anyone wanting to help out is encouraged to contact the author.
PPI is built upon two primary parsing components, PPI::Tokenizer and PPI::Lexer, and a large tree of about 50 classes which implement the various the Perl Document Object Model (PDOM).
The PDOM is conceptually similar in style and intent to the regular DOM or other code Abstract Syntax Trees (ASTs), but contains some differences to handle perl-specific cases, and to assist in treating the code as a document. Please note that it is <B>notB> an implementation of the official Document Object Model specification, only somewhat similar to it.
On top of the Tokenizer, Lexer and the classes of the PDOM, sit a number of classes intended to make life a little easier when dealing with PDOM trees.
Both the major parsing components were hand-coded from scratch with only plain Perl code and a few small utility modules. There are no grammar or patterns mini-languages, no YACC or LEX style tools and only a small number of regular expressions.
This is primarily because of the sheer volume of accumulated cruft that exists in Perl. Not even perl itself is capable of parsing Perl documents (remember, it just parses and executes it as code).
As a result, PPI needed to be cruftier than perl itself. Feel free to shudder at this point, and hope you never have to understand the Tokenizer codebase. Speaking of which...
The Tokenizer takes source code and converts it into a series of tokens. It does this using a slow but thorough character by character manual process, rather than using a pattern system or complex regexes.
Or at least it does so conceptually. If you were to actually trace the code you would find its not truly character by character due to a number of regexps and optimisations throughout the code. This lets the Tokenizer skip ahead when it can find shortcuts, so it tends to jump around a line a bit wildly at times.
In practice, the number of times the Tokenizer will <B>actuallyB> move the character cursor itself is only about 5% - 10% higher than the number of tokens contained in the file. This makes it about as optimal as it can be made without implementing it in something other than Perl.
In 2001 when PPI was started, this structure made PPI quite slow, and not really suitable for interactive tasks. This situation has improved greatly with multi-gigahertz processors, but can still be painful when working with very large files.
The target parsing rate for PPI is about 5000 lines per gigacycle. It is currently believed to be at about 1500, and main avenue for making it to the target speed has now become PPI::XS, a drop-in XS accelerator for PPI.
Since PPI::XS has only just gotten off the ground and is currently only at proof-of-concept stage, this may take a little while. Anyone interested in helping out with PPI::XS is <B>highlyB> encouraged to contact the author. In fact, the design of PPI::XS means its possible to port one function at a time safely and reliably. So every little bit will help.
The Lexer takes a token stream, and converts it to a lexical tree. Because we are parsing Perl <B>documentsB> this includes whitespace, comments, and all number of weird things that have no relevance when code is actually executed.
An instantiated PPI::Lexer consumes PPI::Tokenizer objects and produces PPI::Document objects. However you should probably never be working with the Lexer directly. You should just be able to create PPI::Document objects and work with them directly.
The PDOM is a structured collection of data classes that together provide a correct and scalable model for documents that follow the standard Perl syntax.
The following lists all of the 67 current PDOM classes, listing with indentation based on inheritance.
PPI::Element PPI::Node PPI::Document PPI::Document::Fragment PPI::Statement PPI::Statement::Package PPI::Statement::Include PPI::Statement::Sub PPI::Statement::Scheduled PPI::Statement::Compound PPI::Statement::Break PPI::Statement::Given PPI::Statement::When PPI::Statement::Data PPI::Statement::End PPI::Statement::Expression PPI::Statement::Variable PPI::Statement::Null PPI::Statement::UnmatchedBrace PPI::Statement::Unknown PPI::Structure PPI::Structure::Block PPI::Structure::Subscript PPI::Structure::Constructor PPI::Structure::Condition PPI::Structure::List PPI::Structure::For PPI::Structure::Given PPI::Structure::When PPI::Structure::Unknown PPI::Token PPI::Token::Whitespace PPI::Token::Comment PPI::Token::Pod PPI::Token::Number PPI::Token::Number::Binary PPI::Token::Number::Octal PPI::Token::Number::Hex PPI::Token::Number::Float PPI::Token::Number::Exp PPI::Token::Number::Version PPI::Token::Word PPI::Token::DashedWord PPI::Token::Symbol PPI::Token::Magic PPI::Token::ArrayIndex PPI::Token::Operator PPI::Token::Quote PPI::Token::Quote::Single PPI::Token::Quote::Double PPI::Token::Quote::Literal PPI::Token::Quote::Interpolate PPI::Token::QuoteLike PPI::Token::QuoteLike::Backtick PPI::Token::QuoteLike::Command PPI::Token::QuoteLike::Regexp PPI::Token::QuoteLike::Words PPI::Token::QuoteLike::Readline PPI::Token::Regexp PPI::Token::Regexp::Match PPI::Token::Regexp::Substitute PPI::Token::Regexp::Transliterate PPI::Token::HereDoc PPI::Token::Cast PPI::Token::Structure PPI::Token::Label PPI::Token::Separator PPI::Token::Data PPI::Token::End PPI::Token::Prototype PPI::Token::Attribute PPI::Token::Unknown
To summarize the above layout, all PDOM objects inherit from the PPI::Element class.
At the top of all complete PDOM trees is a PPI::Document object. It represents a complete file of Perl source code as you might find it on disk.
A PPI::Statement is any series of Tokens and Structures that are treated as a single contiguous statement by perl itself. You should note that a Statement is as close as PPI can get to parsing the code in the sense that perl-itself parses Perl code when it is building the op-tree.
Because of the isolation and Perls syntax, it is provably impossible for PPI to accurately determine precedence of operators or which tokens are implicit arguments to a sub call.
So rather than lead you on with a bad guess that has a strong chance of being wrong, PPI does not attempt to determine precedence or sub parameters at all.
At a fundamental level, it only knows that this series of elements represents a single Statement as perl sees it, but it can do so with enough certainty that it can be trusted.
However, for specific Statement types the PDOM is able to derive additional useful information about their meaning. For the best, most useful, and most heavily used example, see PPI::Statement::Include.
A PPI::Structure is any series of tokens contained within matching braces. This includes code blocks, conditions, function argument braces, anonymous array and hash constructors, lists, scoping braces and all other syntactic structures represented by a matching pair of braces, including (although it may not seem obvious at first) <READLINE> braces.
Each Structure contains none, one, or many Tokens and Structures (the rules for which vary for the different Structure subclasses)
Under the PDOM structure rules, a Statement can <B>neverB> directly contain another child Statement, a Structure can <B>neverB> directly contain another child Structure, and a Document can <B>neverB> contain another Document anywhere in the tree.
Aside from these three rules, the PDOM tree is extremely flexible.
To demonstrate the PDOM in use lets start with an example showing how the tree might look for the following chunk of simple Perl code.
#!/usr/bin/perl print( "Hello World!" ); exit();
Translated into a PDOM tree it would have the following structure (as shown via the included PPI::Dumper).
PPI::Document PPI::Token::Comment #!/usr/bin/perl\n PPI::Token::Whitespace \n PPI::Statement PPI::Token::Word print PPI::Structure::List ( ... ) PPI::Token::Whitespace PPI::Statement::Expression PPI::Token::Quote::Double "Hello World!" PPI::Token::Whitespace PPI::Token::Structure ; PPI::Token::Whitespace \n PPI::Token::Whitespace \n PPI::Statement PPI::Token::Word exit PPI::Structure::List ( ... ) PPI::Token::Structure ; PPI::Token::Whitespace \n
The PPI::Dumper module can be used to generate similar trees yourself.
We can make that PDOM dump a little easier to read if we strip out all the whitespace. Here it is again, sans the distracting whitespace tokens.
PPI::Document PPI::Token::Comment #!/usr/bin/perl\n PPI::Statement PPI::Token::Word print PPI::Structure::List ( ... ) PPI::Statement::Expression PPI::Token::Quote::Double "Hello World!" PPI::Token::Structure ; PPI::Statement PPI::Token::Word exit PPI::Structure::List ( ... ) PPI::Token::Structure ;
As you can see, the tree can get fairly deep at time, especially when every isolated token in a bracket becomes its own statement. This is needed to allow anything inside the tree the ability to grow. It also makes the search and analysis algorithms much more flexible.
Because of the depth and complexity of PDOM trees, a vast number of very easy to use methods have been added wherever possible to help people working with PDOM trees do normal tasks relatively quickly and efficiently.
The main PPI classes, and links to their own documentation, are listed here in alphabetical order.
PPI::Document The Document object, the root of the PDOM. PPI::Document::Fragment A cohesive fragment of a larger Document. Although not of any real current use, it is needed for use in certain internal tree manipulation algorithms.
For example, doing things like cut/copy/paste etc. Very similar to a PPI::Document, but has some additional methods and does not represent a lexical scope boundary.
A document fragment is also non-serializable, and so cannot be written out to a file.
PPI::Dumper A simple class for dumping readable debugging versions of PDOM structures, such as in the demonstration above. PPI::Element The Element class is the abstract base class for all objects within the PDOM PPI::Find Implements an instantiable object form of a PDOM tree search. PPI::Lexer The PPI Lexer. Converts Token streams into PDOM trees. PPI::Node The Node object, the abstract base class for all PDOM objects that can contain other Elements, such as the Document, Statement and Structure objects. PPI::Statement The base class for all Perl statements. Generic evaluate for side-effects statements are of this actual type. Other more interesting statement types belong to one of its children.
See its own documentation for a longer description and list of all of the different statement types and sub-classes.
PPI::Structure The abstract base class for all structures. A Structure is a language construct consisting of matching braces containing a set of other elements.
See the PPI::Structure documentation for a description and list of all of the different structure types and sub-classes.
PPI::Token A token is the basic unit of content. At its most basic, a Token is just a string tagged with metadata (its class, and some additional flags in some cases). PPI::Token::_QuoteEngine The PPI::Token::Quote and PPI::Token::QuoteLike classes provide abstract base classes for the many and varied types of quote and quote-like things in Perl. However, much of the actual quote login is implemented in a separate quote engine, based at PPI::Token::_QuoteEngine. PPI::Tokenizer The PPI Tokenizer. One Tokenizer consumes a chunk of text and provides access to a stream of PPI::Token objects.
The Tokenizer is very very complicated, to the point where even the author treads carefully when working with it.
Most of the complication is the result of optimizations which have tripled the tokenization speed, at the expense of maintainability. We cope with the spaghetti by heavily commenting everything.
PPI::Transform The Perl Document Transformation API. Provides a standard interface and abstract base class for objects and classes that manipulate Documents.
The core PPI distribution is pure Perl and has been kept as tight as possible and with as few dependencies as possible.
It should download and install normally on any platform from within the CPAN and CPANPLUS applications, or directly using the distribution tarball. If installing by hand, you may need to install a few small utility modules first. The exact ones will depend on your version of perl.
There are no special install instructions for PPI, and the normal Perl Makefile.PL, make, make test, make install instructions apply.
The PPI namespace itself is reserved for the sole use of the modules under the umbrella of the Parse::Perl SourceForge project.
You are recommended to use the PPIx:: namespace for PPI-specific modifications or prototypes thereof, or Perl:: for modules which provide a general Perl language-related functions.
If what you wish to implement looks like it fits into PPIx:: namespace, you should consider contacting the Parse::Perl mailing list (detailed on the SourceForge site) first, as what you want may already be in progress, or you may wish to consider joining the team and doing it within the Parse::Perl project itself.
- Many more analysis and utility methods for PDOM classes
- Creation of a PPI::Tutorial document
- Add many more key functions to PPI::XS
- Complete the full implementation of ->literal (1.200)
- Full understanding of scoping (due 1.300)
This module is stored in an Open Repository at the following address.
Write access to the repository is made available automatically to any published CPAN author, and to most other volunteers on request.
If you are able to submit your bug report in the form of new (failing) unit tests, or can apply your fix directly instead of submitting a patch, you are <B>stronglyB> encouraged to do so, as the author currently maintains over 100 modules and it can take some time to deal with non-Critical bug reports or patches.
This will also guarantee that your issue will be addressed in the next release of the module.
For large changes though, please consider creating a branch so that they can be properly reviewed and trialed before being applied to the trunk.
If you cannot provide a direct test or fix, or dont have time to do so, then regular bug reports are still accepted and appreciated via the GitHub bug tracker.
For other issues or questions, contact the Parse::Perl project mailing list.
For commercial or media-related enquiries, or to have your SVN commit bit enabled, contact the author.
Adam Kennedy <firstname.lastname@example.org>
A huge thank you to Phase N Australia (<http://phase-n.com/>) for permitting the original open sourcing and release of this distribution from what was originally several thousand hours of commercial work.
Another big thank you to The Perl Foundation (<http://www.perlfoundation.org/>) for funding for the final big refactoring and completion run.
Also, to the various co-maintainers that have contributed both large and small with tests and patches and especially to those rare few who have deep-dived into the guts to (gasp) add a feature.
- Dan Brook : PPIx::XPath, Acme::PerlML - Audrey Tang : "Line Noise" Testing - Arjen Laarhoven : Three-element ->location support - Elliot Shank : Perl 5.10 support, five-element ->location
And finally, thanks to those brave ( and foolish :) ) souls willing to dive in and use, test drive and provide feedback on PPI before version 1.000, in some cases before it made it to beta quality, and still did extremely distasteful things (like eating 50 meg of RAM a second).
I owe you all a beer. Corner me somewhere and collect at your convenience. If I missed someone who wasnt in my email history, thank you too :)
# In approximate order of appearance - Claes Jacobsson - Michael Schwern - Jeff T. Parsons - CPAN Author "CHOCOLATEBOY" - Robert Rotherberg - CPAN Author "PODMASTER" - Richard Soderberg - Nadim ibn Hamouda el Khemir - Graciliano M. P. - Leon Brocard - Jody Belka - Curtis Ovid - Yuval Kogman - Michael Schilli - Slaven Rezic - Lars Thegler - Tony Stubblebine - Tatsuhiko Miyagawa - CPAN Author "CHROMATIC" - Matisse Enzer - Roy Fulbright - Dan Brook - Johnny Lee - Johan Lindstrom
And to single one person out, thanks go to Randal Schwartz who spent a great number of hours in IRC over a critical 6 month period explaining why Perl is impossibly unparsable and constantly shoving evil and ugly corner cases in my face. He remained a tireless devils advocate, and without his support this project genuinely could never have been completed.
So for my schooling in the Deep Magiks, you have my deepest gratitude Randal.
Copyright 2001 - 2011 Adam Kennedy.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
|perl v5.20.3||PPI (3)||2014-11-11|