PPI (3)
Leading comments
Automatically generated by Pod::Man 4.07 (Pod::Simple 3.32) Standard preamble: ========================================================================
NAME
PPI - Parse, Analyze and Manipulate Perl (without perl)SYNOPSIS
use PPI; # Create a new empty document my $Document = PPI::Document->new; # Create a document from source $Document = PPI::Document->new(\'print "Hello World!\n"'); # Load a Document from a file $Document = PPI::Document->new('Module.pm'); # Does it contain any POD? if ( $Document->find_any('PPI::Token::Pod') ) { print "Module contains POD\n"; } # Get the name of the main package $pkg = $Document->find_first('PPI::Statement::Package')->namespace; # Remove all that nasty documentation $Document->prune('PPI::Token::Pod'); $Document->prune('PPI::Token::Comment'); # Save the file $Document->save('Module.pm.stripped');
DESCRIPTION
About this Document
This is theBackground
The ability to read, and manipulate Perl (the language) programmatically other than with perl (the application) was one that caused difficulty for a long time.The cause of this problem was Perl's complex and dynamic grammar. Although there is typically not a huge diversity in the grammar of most Perl code, certain issues cause large problems when it comes to parsing.
Indeed, quite early in Perl's history Tom Christiansen introduced the Perl community to the quote ``Nothing but perl can parse Perl'', or as it is more often stated now as a truism:
``Only perl can parse Perl''
One example of the sorts of things the prevent Perl being easily parsed are function signatures, as demonstrated by the following.
@result = (dothis $foo, $bar); # Which of the following is it equivalent to? @result = (dothis($foo), $bar); @result = dothis($foo, $bar);
The first line above can be interpreted in two different ways, depending on whether the &dothis function is expecting one argument, or two, or several.
A ``code parser'' (something that parses for the purpose of execution) such as perl needs information that is not found in the immediate vicinity of the statement being parsed.
The information might not just be elsewhere in the file, it might not even be in the same file at all. It might also not be able to determine this information without the prior execution of a "BEGIN {}" block, or the loading and execution of one or more external modules. Or worse the &dothis function may not even have been written yet.
When parsing Perl as code, you must also execute it
Even perl itself never really fully understands the structure of the source code after and indeed as it processes it, and in that sense doesn't ``parse'' Perl source into anything remotely like a structured document. This makes it of no real use for any task that needs to treat the source code as a document, and do so reliably and robustly.
For more information on why it is impossible to parse perl, see Randal Schwartz's seminal response to the question of ``Why can't you parse Perl''.
<www.perlmonks.org/index.pl?node_id=44722>
The purpose of
Historically, using an embedded perl parser was widely considered to be the most likely avenue for finding a solution to parsing Perl. It has been investigated from time to time, but attempts have generally failed or suffered from sufficiently bad corner cases that they were abandoned.
What Does PPI Stand For?
"PPI" is an acronym for the longer original module name
"Parse::Perl::Isolated". And in the spirit or the silly acronym games
played by certain unnamed Open Source projects you may have hurd of,
it also a reverse backronym of ``I Parse Perl''.
Of course, I could just be lying and have just made that second bit up 10 minutes before the release of
Why don't you just think of it as the Perl Parsing Interface for simplicity.
The original name was shortened to prevent the author (and you the users) from contracting
In acknowledgment that someone may some day come up with a valid solution for the grammar problem it was decided at the commencement of the project to leave the "Parse::Perl" namespace free for any such effort.
Since that time I've been able to prove to my own satisfaction that it is truly impossible to accurately parse Perl as both code and document at once. For the academics, parsing Perl suffers from the ``Halting Problem''.
Why Parse Perl?
Once you can accept that we will never be able to parse Perl well enough to meet the standards of things that treat Perl as code, it is worth re-examining "why" we want to ``parse'' Perl at all.What are the things that people might want a ``Perl parser'' for.
- Documentation
-
Analyzing the contents of a Perl document to automatically generate
documentation, in parallel to, or as a replacement for, PODdocumentation.
Allow an indexer to locate and process all the comments and documentation from code for ``full text search'' applications.
- Structural and Quality Analysis
-
Determine quality or other metrics across a body of code, and identify
situations relating to particular phrases, techniques or locations.
Index functions, variables and packages within Perl code, and doing search and graph (in the node/edge sense) analysis of large code bases.
Perl::Critic, based on
PPI,is a large, thriving tool for bug detection and style analysis of Perl code. - Refactoring
- Make structural, syntax, or other changes to code in an automated manner, either independently or in assistance to an editor. This sort of task list includes backporting, forward porting, partial evaluation, ``improving'' code, or whatever. All the sort of things you'd want from a Perl::Editor.
- Layout
- Change the layout of code without changing its meaning. This includes techniques such as tidying (like perltidy), obfuscation, compressing and ``squishing'', or to implement formatting preferences or policies.
- Presentation
- This includes methods of improving the presentation of code, without changing the content of the code. Modify, improve, syntax colour etc the presentation of a Perl document. Generating ``IntelliText''-like functions.
If we treat this as a baseline for the sort of things we are going to have to build on top of Perl, then it becomes possible to identify a standard for how good a Perl parser needs to be.
How good is Good Enough(TM)
However, there are going to be limits to this process. Because
At one extreme, this includes anything munged by Acme::Bleach, as well as (arguably) more common cases like Switch. We do not pretend to be able to always parse code using these modules, although as long as it still follows a format that looks like Perl syntax, it may be possible to extend the lexer to handle them.
The ability to extend
The goal for success was originally to be able to successfully parse 99% of all Perl documents contained in
So unless you are actively going out of your way to break
Internationalisation
Specifically, it allows the use characters from the Latin-1 character set to be used in quotes, comments, and
Round Trip Safe
WhenThe general concept behind a ``Round Trip'' parser is that it knows what it is parsing is somewhat uncertain, and so expects to get things wrong from time to time. In the cases where it parses code wrongly the tree will serialize back out to the same string of code that was read in, repairing the parser's mistake as it heads back out to the file.
The end result is that if you parse in a file and serialize it back out without changing the tree, you are guaranteed to get the same file you started with.
What goes in, will come out. Every time.
The one minor exception at this time is that if the newlines for your file are wrong (meaning not matching the platform newline format),
Better control of the newline type is on the wish list though, and anyone wanting to help out is encouraged to contact the author.
IMPLEMENTATION
General Layout
The
On top of the Tokenizer, Lexer and the classes of the
Both the major parsing components were hand-coded from scratch with only plain Perl code and a few small utility modules. There are no grammar or patterns mini-languages, no
This is primarily because of the sheer volume of accumulated cruft that exists in Perl. Not even perl itself is capable of parsing Perl documents (remember, it just parses and executes it as code).
As a result,
The Tokenizer
The Tokenizer takes source code and converts it into a series of tokens. It does this using a slow but thorough character by character manual process, rather than using a pattern system or complex regexes.Or at least it does so conceptually. If you were to actually trace the code you would find it's not truly character by character due to a number of regexps and optimisations throughout the code. This lets the Tokenizer ``skip ahead'' when it can find shortcuts, so it tends to jump around a line a bit wildly at times.
In practice, the number of times the Tokenizer will actually move the character cursor itself is only about 5% - 10% higher than the number of tokens contained in the file. This makes it about as optimal as it can be made without implementing it in something other than Perl.
In 2001 when
The target parsing rate for
Since
The Lexer
The Lexer takes a token stream, and converts it to a lexical tree. Because we are parsing Perl documents this includes whitespace, comments, and all number of weird things that have no relevance when code is actually executed.An instantiated PPI::Lexer consumes PPI::Tokenizer objects and produces PPI::Document objects. However you should probably never be working with the Lexer directly. You should just be able to create PPI::Document objects and work with them directly.
The Perl Document Object Model
TheThe PDOM Class Tree
The following lists all of the 67 current
PPI::Element PPI::Node PPI::Document PPI::Document::Fragment PPI::Statement PPI::Statement::Package PPI::Statement::Include PPI::Statement::Sub PPI::Statement::Scheduled PPI::Statement::Compound PPI::Statement::Break PPI::Statement::Given PPI::Statement::When PPI::Statement::Data PPI::Statement::End PPI::Statement::Expression PPI::Statement::Variable PPI::Statement::Null PPI::Statement::UnmatchedBrace PPI::Statement::Unknown PPI::Structure PPI::Structure::Block PPI::Structure::Subscript PPI::Structure::Constructor PPI::Structure::Condition PPI::Structure::List PPI::Structure::For PPI::Structure::Given PPI::Structure::When PPI::Structure::Unknown PPI::Token PPI::Token::Whitespace PPI::Token::Comment PPI::Token::Pod PPI::Token::Number PPI::Token::Number::Binary PPI::Token::Number::Octal PPI::Token::Number::Hex PPI::Token::Number::Float PPI::Token::Number::Exp PPI::Token::Number::Version PPI::Token::Word PPI::Token::DashedWord PPI::Token::Symbol PPI::Token::Magic PPI::Token::ArrayIndex PPI::Token::Operator PPI::Token::Quote PPI::Token::Quote::Single PPI::Token::Quote::Double PPI::Token::Quote::Literal PPI::Token::Quote::Interpolate PPI::Token::QuoteLike PPI::Token::QuoteLike::Backtick PPI::Token::QuoteLike::Command PPI::Token::QuoteLike::Regexp PPI::Token::QuoteLike::Words PPI::Token::QuoteLike::Readline PPI::Token::Regexp PPI::Token::Regexp::Match PPI::Token::Regexp::Substitute PPI::Token::Regexp::Transliterate PPI::Token::HereDoc PPI::Token::Cast PPI::Token::Structure PPI::Token::Label PPI::Token::Separator PPI::Token::Data PPI::Token::End PPI::Token::Prototype PPI::Token::Attribute PPI::Token::Unknown
To summarize the above layout, all
Under this are PPI::Token, strings of content with a known type, and PPI::Node, syntactically significant containers that hold other Elements.
The three most important of these are the PPI::Document, the PPI::Statement and the PPI::Structure classes.
The Document, Statement and Structure
At the top of all completeThere are some specialised types of document, such as PPI::Document::File and PPI::Document::Normalized but for the purposes of the
Each Document will contain a number of Statements, Structures and Tokens.
A PPI::Statement is any series of Tokens and Structures that are treated as a single contiguous statement by perl itself. You should note that a Statement is as close as
Because of the isolation and Perl's syntax, it is provably impossible for
So rather than lead you on with a bad guess that has a strong chance of being wrong,
At a fundamental level, it only knows that this series of elements represents a single Statement as perl sees it, but it can do so with enough certainty that it can be trusted.
However, for specific Statement types the
A PPI::Structure is any series of tokens contained within matching braces. This includes code blocks, conditions, function argument braces, anonymous array and hash constructors, lists, scoping braces and all other syntactic structures represented by a matching pair of braces, including (although it may not seem obvious at first) "<READLINE>" braces.
Each Structure contains none, one, or many Tokens and Structures (the rules for which vary for the different Structure subclasses)
Under the
Aside from these three rules, the
The PDOM at Work
To demonstrate the
#!/usr/bin/perl print( "Hello World!" ); exit();
Translated into a
PPI::Document PPI::Token::Comment '#!/usr/bin/perl\n' PPI::Token::Whitespace '\n' PPI::Statement PPI::Token::Word 'print' PPI::Structure::List ( ... ) PPI::Token::Whitespace ' ' PPI::Statement::Expression PPI::Token::Quote::Double '"Hello World!"' PPI::Token::Whitespace ' ' PPI::Token::Structure ';' PPI::Token::Whitespace '\n' PPI::Token::Whitespace '\n' PPI::Statement PPI::Token::Word 'exit' PPI::Structure::List ( ... ) PPI::Token::Structure ';' PPI::Token::Whitespace '\n'
Please note that in this example, strings are only listed for the actual PPI::Token that contains that string. Structures are listed with the type of brace characters it represents noted.
The PPI::Dumper module can be used to generate similar trees yourself.
We can make that
PPI::Document PPI::Token::Comment '#!/usr/bin/perl\n' PPI::Statement PPI::Token::Word 'print' PPI::Structure::List ( ... ) PPI::Statement::Expression PPI::Token::Quote::Double '"Hello World!"' PPI::Token::Structure ';' PPI::Statement PPI::Token::Word 'exit' PPI::Structure::List ( ... ) PPI::Token::Structure ';'
As you can see, the tree can get fairly deep at time, especially when every isolated token in a bracket becomes its own statement. This is needed to allow anything inside the tree the ability to grow. It also makes the search and analysis algorithms much more flexible.
Because of the depth and complexity of
Overview of the Primary Classes
The main- PPI::Document
-
The Document object, the root of the PDOM.
- PPI::Document::Fragment
-
A cohesive fragment of a larger Document. Although not of any real current
use, it is needed for use in certain internal tree manipulation
algorithms.
For example, doing things like cut/copy/paste etc. Very similar to a PPI::Document, but has some additional methods and does not represent a lexical scope boundary.
A document fragment is also non-serializable, and so cannot be written out to a file.
- PPI::Dumper
-
A simple class for dumping readable debugging versions of PDOMstructures, such as in the demonstration above.
- PPI::Element
-
The Element class is the abstract base class for all objects within the PDOM
- PPI::Find
-
Implements an instantiable object form of a PDOMtree search.
- PPI::Lexer
-
The PPILexer. Converts Token streams intoPDOMtrees.
- PPI::Node
-
The Node object, the abstract base class for all PDOMobjects that can contain other Elements, such as the Document, Statement and Structure objects.
- PPI::Statement
-
The base class for all Perl statements. Generic ``evaluate for side-effects''
statements are of this actual type. Other more interesting statement types
belong to one of its children.
See its own documentation for a longer description and list of all of the different statement types and sub-classes.
- PPI::Structure
-
The abstract base class for all structures. A Structure is a language
construct consisting of matching braces containing a set of other elements.
See the PPI::Structure documentation for a description and list of all of the different structure types and sub-classes.
- PPI::Token
- A token is the basic unit of content. At its most basic, a Token is just a string tagged with metadata (its class, and some additional flags in some cases).
- PPI::Token::_QuoteEngine
-
The PPI::Token::Quote and PPI::Token::QuoteLike classes provide
abstract base classes for the many and varied types of quote and
quote-like things in Perl. However, much of the actual quote login is
implemented in a separate quote engine, based at
PPI::Token::_QuoteEngine.
Classes that inherit from PPI::Token::Quote, PPI::Token::QuoteLike and PPI::Token::Regexp are generally parsed only by the Quote Engine.
- PPI::Tokenizer
-
The PPITokenizer. One Tokenizer consumes a chunk of text and provides access to a stream of PPI::Token objects.
The Tokenizer is very very complicated, to the point where even the author treads carefully when working with it.
Most of the complication is the result of optimizations which have tripled the tokenization speed, at the expense of maintainability. We cope with the spaghetti by heavily commenting everything.
- PPI::Transform
-
The Perl Document Transformation API.Provides a standard interface and abstract base class for objects and classes that manipulate Documents.
INSTALLING
The coreIt should download and install normally on any platform from within the
There are no special install instructions for
EXTENDING
TheIf what you wish to implement looks like it fits into the PPIx:: namespace, you should consider contacting the
TO DO
- Many more analysis and utility methods for- Creation of a PPI::Tutorial document
- Add many more key functions to
- We can always write more and better unit tests
- Complete the full implementation of ->literal (1.200)
- Full understanding of scoping (due 1.300)
SUPPORT
The most recent version of<search.cpan.org/~mithaldu/PPI>
Contributions via GitHub pull request are welcome.
Bug fixes in the form of pull requests or bug reports with new (failing) unit tests have the best chance of being addressed by busy maintainers, and are strongly encouraged.
If you cannot provide a test or fix, or don't have time to do so, then regular bug reports are still accepted and appreciated via the GitHub bug tracker.
<github.com/adamkennedy/PPI/issues>
The "ppidump" utility that is part of the Perl::Critic distribution is a useful tool for demonstrating how
For other issues, questions, or commercial or media-related enquiries, contact the author.
AUTHOR
Adam Kennedy <adamk@cpan.org>ACKNOWLEDGMENTS
A huge thank you to Phase N Australia (<phase-n.com>) for permitting the original open sourcing and release of this distribution from what was originally several thousand hours of commercial work.Another big thank you to The Perl Foundation (<www.perlfoundation.org>) for funding for the final big refactoring and completion run.
Also, to the various co-maintainers that have contributed both large and small with tests and patches and especially to those rare few who have deep-dived into the guts to (gasp) add a feature.
- Dan Brook : PPIx::XPath, Acme::PerlML - Audrey Tang : "Line Noise" Testing - Arjen Laarhoven : Three-element ->location support - Elliot Shank : Perl 5.10 support, five-element ->location
And finally, thanks to those brave ( and foolish :) ) souls willing to dive in and use, test drive and provide feedback on
I owe you all a beer. Corner me somewhere and collect at your convenience. If I missed someone who wasn't in my email history, thank you too :)
# In approximate order of appearance - Claes Jacobsson - Michael Schwern - Jeff T. Parsons - CPAN Author "CHOCOLATEBOY" - Robert Rotherberg - CPAN Author "PODMASTER" - Richard Soderberg - Nadim ibn Hamouda el Khemir - Graciliano M. P. - Leon Brocard - Jody Belka - Curtis Ovid - Yuval Kogman - Michael Schilli - Slaven Rezic - Lars Thegler - Tony Stubblebine - Tatsuhiko Miyagawa - CPAN Author "CHROMATIC" - Matisse Enzer - Roy Fulbright - Dan Brook - Johnny Lee - Johan Lindstrom
And to single one person out, thanks go to Randal Schwartz who spent a great number of hours in
So for my schooling in the Deep Magiks, you have my deepest gratitude Randal.
COPYRIGHT
Copyright 2001 - 2011 Adam Kennedy.This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the