![]() |
![]()
| ![]() |
![]()
NAMEucto - Unicode Tokenizer SYNOPSISucto [[options]] [input‐file] [[output‐file]] DESCRIPTIONucto tokenizes text files: it separates words from punctuation, splits sentences (and optionally paragraphs), and finds paired quotes. Ucto is preconfigured with tokenisation rules for several languages. Those rules are provided by uctodata OPTIONS-c configfile read settings from a 'configfile'
-B run in batch mode. Process all inputfiles to an output
directory specified with -O.
-d value set debug mode to 'value'
-e value set input encoding. (default UTF8)
-I value set the input directory to 'value'. (batch mode
only)
-O value set the ouput directory to 'value'. (Required for batch
mode)
-N value set UTF8 output normalization. (default NFC)
--filter=[YES|NO] disable filtering of special characters, (default YES)
These special characters can be specified in the [FILTER] block of the
configuration file.
-L language Automatically selects a configuration file by language
code. The language code is generally a three-letter iso-639-3 code. For
example, 'fra' will select the file tokconfig‐fra from the installation
directory
--detectlanguages=<lang1,lang2,..langn> try to detect all the specified languages. The default
language will be 'lang1'. (only useful for FoLiA output).
All values must be iso-639-3 codes. You can also use the special language code `und`. This ensures there is NO default language, and any language that is NOT in the list will remain unanalyzed. Warning: To be able to handle utterances of mixed language, Ucto uses a simple sentence splitter based on the markers '.' '?' and '!'. This may occasionally lead to surprising results. -l Convert output text to all lowercase
-u Convert all input text to all uppercase
-n Emit one sentence per line on output
-m Assume one sentence per line on input
--normalize=class1,class2,..,classn map all occurrences of tokens with class1,...class to
their generic names. e.g --normalize=DATE will map all dates to the word
{{DATE}}. Very useful to normalize tokens like URL's, DATE's, E-mail addresses
and so on.
-T value or --textredundancy=value set text redundancy level for text nodes in FoLiA output:
'full' - add text to all levels: <p> <s> <w> etc. 'minimal' - don't introduce text on higher levels, but retain what is already there. 'none' - only introduce text on <w>, AND remove all text from higher levels --allow-word-correction Allow ucto to tokenize inside FoLiA Word elements,
creating FoLiA Corrections
--ignore-tag-hints Skip all tag=token hints from the FoLiA input.
These hints can be used to signal text markup like subscript and
superscript
--add-tokens="file" Add additional tokens to the [TOKENS] block of the
default language. The file should contain one TOKEN per line.
--passthru Don't tokenize, but perform input decoding and simple
token role detection
--filterpunct remove most of the punctuation from the output. (not from
abreviations and embedded punctuation like John's)
-P Disable Paragraph Detection
-Q Enable Quote Detection. (this is experimental and may
lead to unexpected results)
-s <string> Set End‐of‐sentence marker. (Default
<utt>)
-V or -- version Show version information
-v set Verbose mode
-F The input file(s) are assumed to be FoLiA XML. Text in
the correct 'inputclass' will be tokenized. For files with an '.xml'
extension, -F is the default.
In batch mode, this forces to only select files with the '.xml' extension from the input directory. --inputclass="cls" When tokenizing a FoLiA XML document, search for text
nodes of class 'cls'. The default is "current".
--outputclass="cls" When tokenizing a FoLiA XML document, output the
tokenized text in text nodes with 'cls'. The default is "current".
It is recommended to have different classes for input and output.
--textclass="cls"(obsolete) use 'cls' for input and output of text from FoLiA.
Equivalent to both --inputclass='cls' and --outputclass='cls')
This option is obsolete and NOT recommended. Please use the separate --inputclass= and --outputclass options. --copyclass when ucto is used on FoLiA with fully tokenized text in
inputclass='inputclass', no text in textclass 'outputclass' is produced. (A
warning will be given). To circumvent this. Add the --copyclass option.
Which assures that text will be emitted in that class
-X All output will be FoLiA XML. Document id's are
autogenerated.
Works in batch mode too. --id <DocId> Use the specified Document ID for the FoLiA XML. (not
allowed in batch mode) When not provided, a document is is generated based on
the nema of the input file.
BUGSlikely AUTHORSMaarten van Gompel Ko van der Sloot e-mail: lamasoftware@science.ru.nl
|