|
|
| |
Const(3) |
User Contributed Perl Documentation |
Const(3) |
Lingua::Treebank::Const - Object modeling constituent from a treebank
use Lingua::Treebank::Const;
my $text = <<EOTREE
(S
(NP-SBJ (DT this) )
(VP (VBZ is)
(NP-PRD (NNP Lisa) ))
(. .) )
TREE
my $utt = Lingua::Treebank::Const->new->from_penn_string($text)
print $utt->as_penn_text(), "\n";;
Results:
(S
(NP-SBJ
(DT this))
(VP
(VBZ is)
(NP-PRD
(NNP Lisa) ))
(. .))
This is configurable (TO DO: document how so).
module defines methods for accessing syntactic constituents; it
identifies its parents and its children, and can write itself out in
a variety of formats (currently Penn treebank style).
Module for describing simple constituents of the Penn Treebank. Recursive
behaviors are implied.
Note assumption that terminal nodes (those with defined
"word" values) will not have
"children", and vice versa. This
assumption is currently unchecked by the code.
For a number of these methods, the jargonish notion of
domination plays a large role, so for those who might not know:
a node "A" dominates
another node "B" if
"B" is a descendant of
"A".
- new
- Constructs a new (uninitialized) token. If starting from text, can be used
together with the "from_penn_string"
initialization method, as below:
my $text = <<EOTREE
(S
(NP-SBJ (DT this) )
(VP (VBZ is)
(NP-PRD (NNP Lisa) ))
(. .) )
TREE
my $utt = Lingua::Treebank::Const->new->from_penn_string($text)
Otherwise, resulting new unit will have no values
("parent",
"children",
"tag" or
"word" set by default.
These methods help to populate the fields of these objects from external data.
- from_penn_string
- given a string of the Penn format, e.g.,
(S
(NP-SBJ (DT this) )
(VP (VBZ is)
(NP-PRD (NNP Lisa) ))
(. .) )
populates the current node with
"tag"
"S" and the
"children" field with new objects
("tag"
"NP",
"tag"
"VP", and
"tag"
"."). This method recurses on
"new" and
"from_penn_string" to do its job.
- tag
- Records the tag of this constituent (0-domination). (This is the part of
the constituent-label before the hyphen -- after the hyphen is the
"annot", not the
"tag").
TO DO: example here.
- annot
- Returns whatever comes after the hyphen in the constituent label.
- word
- If this constituent is terminal, then
"word" should contain the lexical item
that is represented.
- text
- A string containing the word values of the terminal nodes donminated by
this constituent. For example, calling text on a node created from the
Penn text given in the description of the new function returns the string
"this is lisa .".
- parent
- Returns the parent of the current node.
- children
- Returns a reference to an array of
"Lingua::Treebank::Const" objects that
are the children of the current node.
Currently does not check whether
"word" is populated.
These methods ask questions about the dominating ancestors and direct children
of the current node. Think of them as navigating up-and-down the tree.
- is_terminal
- No arguments
Returns whether self is a leaf. Does not check whether
"children" are populated; if
automatically generated from the
"from_penn_string" method then this
will always be correct.
- is_root
- No arguments. Boolean. Returns whether the instance is a root node (has no
parents).
- root
- No arguments.
Returns the root node for the instance in question (might be
itself)
- path_up_to
- Takes an ancestor node as argument.
Returns a list of all the nodes (distal first) between the
instance and the root.
Returns undefined and
"carp"s when the given node is not an
ancestor of the instance.
- is_descendant_of
- Takes presumed ancestor as argument.
Returns whether the ancestor is indeed an ancestor of the
current instance.
- is_ancestor_of
- Takes presumed descendant as argument.
Returns whether current instance is an ancestor of the
presumed descendant.
- is_sibling
- Takes presumed sibling as argument.
Returns whether current instance shares an immediate parent
with the presumed sibling.
- height
- No arguments.
Returns the farthest distance from the current node to
a terminal node.
- depth
- No arguments.
Returns the distance from the instance to the root.
- depth_from
- what's the distance from the current node up to the node given as
argument? (return undefined if the node given is not the ancestor
of the instance node)
These methods ask questions about siblings, and left-right movement in the tree.
Think of them as moving left-and-right around in the tree.
- get_index
- One argument (daughter).
Returns the index of the daughter in the instance's
"children" list. Zero-based, of
course.
- prev_sib
- next_sib
- No arguments. Returns next (or previous) sibling at the same level
(dependent on the same parent), or the empty list if no such leaf
exists.
- prev_leaf
- next_leaf
- No arguments. Returns the leaf just before (or after) any of the leaves of
this node, or the empty list if no such leaf exists.
- left_leaf
- right_leaf
- No arguments. Returns leftmost (rightmost) leaf dominated by the
instance.
- get_all_terminals
- No arguments. Returns left-to-right list of all terminal nodes at or below
the current node.
- find_common_ancestor
- One argument: a presumed cousin.
returns the lowest ancestor the instance and the cousin share
(or undefined if they do not share an ancestor)
- select_ancestors
- select_children
- Both these methods take a subroutine as an argument and return those
[child/ancestor] nodes that return true when the sub is called with the
node as an argument.
The expectation is that the sub will not modify the node.
These methods are ways of exposing and comparing regions of local structure.
- equiv_to
- Tests whether the argument has the same structure (and words) as the
instance. These need not be the same object, just the same tags and words
in the same structure.
- equiv_tags
- equiv_words
- Handy -- and unimplemented -- shorthands for checking certain kinds of
matching structure.
- detach
- Argument is DAUGHTER node.
Removes the DAUGHTER from the
"children" list of the current
instance. DAUGHTER node will still be a valid node, but it will
no longer have a "parent"; it will be
a "root".
Note that "detach" may leave
a degenerate tree: it may have no terminal node (one with words) at the
end of a branch. To avoid this, use
"wither" instead.
- wither
- No arguments.
Detaches self from parent. self will become an independent
root. If the parent has no other children, will recursively call
"parent-"wither>, making a possibly
zero-length list of degenerate roots above it until an ancestor has a
different child than the one in this line of descent.
A A
/ \ C B |
B X / \ X
/ \ => D E |
C Y Y
/ \
D E
Before After
calling C->wither()
- prepend
- append
- Arguments are a LIST of new daughters to be inserted at the
beginning/end of the "children"
list.
- replace
- Arguments are (DAUGHTER, LIST). Replaces DAUGHTER
with the elements of LIST in the
"children" of the current instance.
DAUGHTER is now its own
"root"; see
"detach".
- flatten
- pull up all terminals to be children of the instance, regardless of how
deep they are. Re-attach them to the current node, preserving leaf order.
A->flatten()
/ /
A ==> A__
/ \ /|\ \
X B C F D G
/|\ \
C F D E
\
G
- retract
- pulls in and removes one non-terminal node (which node is specified
by argument), attaching its children directly to the current node,
retaining what surface order the children originally had, e.g.:
A->retract(X)
/ /
A ==> A
/ \ /|\
X B C D B
/ \ \ / \ \
C D E F G E
/ \
F G
- detach_at
- Argument is INDEX. Removes the daughter at INDEX. Will
"carp" if there is no daughter at
INDEX.
The daughter at INDEX remains well-formed, though if
you do not maintain your own pointer to it, it will probably be
collected by the garbage collector.
- insert_at
- Arguments are INDEX, LIST of daughters. LIST
daughters will be inserted beginning at position INDEX of the
current instances "children".
These methods are methods that may (or not) be useful in programming with these
objects. These methods are used internally, but are exposed for the programmer
who might need them.
"stringify" overloading is
certainly helpful in debugging, since the perl debugger representation of
these objects is complicated by their up-reference to parents.
- as_penn_text
- Returns a text string representing this constituent.
To do: document additional parameters to this, and the
possible effects of changing them
- $Lingua::Treebank::Const::CHILD_PROLOG
- $Lingua::Treebank::Const::INDENT_CHAR
- $Lingua::Treebank::Const::CHILD_EPILOG
- stringify
- This is the method called by default when the object handle is used in a
string (see "perldoc overload").
Depending on the value of
$Lingua::Treebank::Const::STRINGIFY (see below),
the string representation of the object varies. The default behavior is
"as_penn_text", above.
Note that like any object-ref, copying its stringification
does NOT work to retain all its behaviors. Nor does an identical string
representation necessarily mean the two objects are the same
object; merely, that they have the same structure. (see
"equiv_to").
- numerify
- This is the mthod called by default when the object handle is used in a
numeric context (usually "==" or
"!=").
Returns an integer representing the unique object. Identity on
this method does indicate identity of the objects.
Rarely used in client code. The numeric inequality operators
are unlikely to have any useful meaning on these objects, though they
should behave consistently (you should get consistent answers given any
two objects, regardless of methods called on those objects).
- walk ( &action, &stop_crit, $state, $bf_traversal )
- An instance method. &action argument is
required, others are optional.
Calls &action (a subroutine ref)
as a method on node and its children, recursively, passing the node
under consideration and the $state value (if
provided).
If &stop_crit is defined, calls it
on each node; when &stop_crit returns true,
children of that node are not pursued.
For both "action" and
"stop_crit" commands, if a string is
passed, it will be called if a method by that name can be found in the
object.
$state is passed into each of the
child method calls. This is convenient for things like pushing
interesting elements onto a list, or updating a counter. It must be a
scalar, but can be a reference.
Passing a true value as $bf_traversal
tells "walk()" to explore the tree
breadth-first rather than depth-first. passing a false (but defined)
value forces depth-first. Undefined values default to the value of
$Lingua::Treebank::Const::BF_TRAVERSAL, which is
"undef" (false) -- and thus
depth-first by default.
# find out how many children each NP has, but don't count anything
# inside an EDITED node
my $action = sub {
my ($self, $state) = @_;
return unless $self->tag() eq 'NP';
# just print it
print scalar @{$self->children}, "\n";
# or store it in the state variable
push @{$state}, scalar @{$self->children()};
};
my $stop_crit = sub {$_[0]->tag() eq 'EDITED'};
$tree->walk( $action, $stop_crit, \@counts );
use List::Util 'sum';
print "there were ", sum (@counts),
" total children of NP nodes\n";
- $Lingua::Treebank::Const::BF_TRAVERSAL
- Defaults to undefined. If true, changes the default behavior of the
walk() method to be breadth-first rather than depth-first.
- $Lingua::Treebank::STRINGIFY
- Changes the default stringification of the objects. Can be set to any of
the following three values:
- as_penn_text
- default value.
e.g.
(S
(NP
(NNP Joe)
)
(VP
(VB likes)
(NP
(NNP Bach)
)
)
(. .)
)
- words
- e.g.
Joe likes Bach .
- preterm_tags
- e.g.,
NNP VB NNP .
check that destroy doesn't leak (undo parent links?)
dump as latex tree
read in other treebank formats (latex trees?)
- 0.01
- Original version; created by h2xs 1.22 with options
-CAX
Lingua::Treebank::Const
- 0.02
- Improved comparison code by caching numerify results.
- Should give minor speed improvements for data that works with the same
tree over more than one operation. Little if any degradation (tiny
increase in size) for those who only use each tree once.
- Improved documentation.
- Now lists all instance methods. Instance method documentation also
organized better -- now falls into categories.
- 0.03
- new interface variable
- added $VERBOSE variable for suppressing non-fatal
errors.
- improved parsing
- now copes with examples like (e.g.):
((FRAG (FOO bar))
critically, earlier versions failed when the tag was empty and
not followed by whitespace
- 0.08
- added new methods
- select_ancestors
- select_children
- is_empty_root
- walk
- new interface variable
- added $BF_TRAVERSAL for changing walk()
method defaults
- 0.09
- added new methods
- TODO: document these, add test cases, update version number
- edges
- now with new ignore feature!
- shared_edges
- list_constituents
- 0.16
- Version number jump to keep up with Lingua::Treebank
Documentation for Penn treebank <http://www.cis.upenn.edu/~treebank/>.
Jeremy Gillmor Kahn, <kahn@cpan.org>
Copyright 2003 by Jeremy Gillmor Kahn
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |