vocabulary -- extract vocabularies from Penn treebank files
vocabulary [-NT ntfile] [-POS posfile] [-word wordfile] [-count] [-binarized]
[-verbose] file1 [file2...]
File1, file2 etc. are the names of Penn treebank files. If none are specified,
STDIN is used.
- Write the non-terminal node vocabulary to ntfile.
- Write the part of speech vocabulary to posfile
- Write the word vocabulary to wordfile.
- Print the frequency counts for each of the categories.
- The file is in binarized format.
- Print filenames as they are processed.
Given a list of Penn treebank files, this script extracts the words, parts of
speech, and non-terminal node names and emits each in a separate file in order
Note that giving a "-" argument for any of ntfile, posfile, or
wordfile causes the results to be written to STDOUT.
W.P. McNeill <firstname.lastname@example.org>