|TERMS||The amount of search terms|
|TERM_MIN||The minimum length of a term|
|TERM_MAX||The maximum length of a term|
|INPUT||The count of input strings which will be checked to see if they are prefixed with any of the TERMS. The strings are each exactly one character longer than TERM_MAX|
A few methods were benchmarked, and are listed as keys:
|TMFA||Text::Match::FastAlternatives match_at function|
|perl-re||Generic perl regex. The capturing version is qr/^(term1|term2)/, and the non-capturing version is qr/^(?:term1|term2)/, where the terms are joined together in a list2re fashion.|
|RE2||Same as perl-re, except using re::engine::RE2|
|TXS||Using a loop of prefix_search over the input items|
|TXS-Multi||Using a single function call to prefix_search_multi|
Generated INPUT=2000000 TERMS=20 TERM_MIN=3 TERM_MAX=6 CAP NAME DUR MATCH [N] TMFA 1.12s M=23578 [N] perl-re 1.24s M=23578 [N] RE2 0.91s M=23578 [Y] perl-re 1.44s M=23578 [Y] RE2 2.91s M=23578 [Y] TXS 0.53s M=23578 [Y] TXS-Multi 0.18s M=23578 Generated INPUT=2000000 TERMS=50 TERM_MIN=10 TERM_MAX=16 CAP NAME DUR MATCH [N] TMFA 1.14s M=50 [N] perl-re 1.20s M=50 [N] RE2 0.90s M=50 [Y] perl-re 1.43s M=50 [Y] RE2 1.10s M=50 [Y] TXS 0.53s M=50 [Y] TXS-Multi 0.17s M=50 Generated INPUT=2000000 TERMS=49 TERM_MIN=2 TERM_MAX=16 CAP NAME DUR MATCH [N] TMFA 1.18s M=241799 [N] perl-re 1.44s M=241799 [N] RE2 1.01s M=241799 [Y] perl-re 1.77s M=241799 [Y] RE2 4.94s M=241799 [Y] TXS 1.47s M=241799 [Y] TXS-Multi 1.15s M=241799 Generated INPUT=2000000 TERMS=10 TERM_MIN=5 TERM_MAX=10 CAP NAME DUR MATCH [N] TMFA 1.12s M=131 [N] perl-re 1.27s M=131 [N] RE2 0.97s M=131 [Y] perl-re 1.50s M=131 [Y] RE2 2.76s M=131 [Y] TXS 0.46s M=131 [Y] TXS-Multi 0.09s M=131 Generated INPUT=2000000 TERMS=100 TERM_MIN=3 TERM_MAX=25 CAP NAME DUR MATCH [N] TMFA 1.15s M=15734 [N] perl-re 1.26s M=15734 [N] RE2 0.94s M=15734 [Y] perl-re 1.49s M=15734 [Y] RE2 1.69s M=15734 [Y] TXS 1.06s M=15734 [Y] TXS-Multi 0.68s M=15734 Generated INPUT=2000000 TERMS=200 TERM_MIN=5 TERM_MAX=25 CAP NAME DUR MATCH [N] TMFA 1.15s M=1300 [N] perl-re 1.22s M=1300 [N] RE2 1.00s M=1300 [Y] perl-re 1.43s M=1300 [Y] RE2 1.27s M=1300 [Y] TXS 0.73s M=1300 [Y] TXS-Multi 0.24s M=1300 Generated INPUT=2000000 TERMS=8 TERM_MIN=2 TERM_MAX=5 CAP NAME DUR MATCH [N] TMFA 1.12s M=88025 [N] perl-re 1.22s M=88025 [N] RE2 0.82s M=88025 [Y] perl-re 1.64s M=88025 [Y] RE2 1.13s M=88025 [Y] TXS 0.63s M=88025 [Y] TXS-Multi 0.27s M=88025
Ive mainly tested this on Debians 5.10 - for newer perls, this module performs better, and for el5 5.8, The differences are a bit lower. TBC
There are quite a few modules out there which aim for a Trie-like search, but they are all either not written in C, or would not be performant enough for this application.
These two modules are implemented in pure perl, and are not part of the comparison.
The performance gains from this algorithm are less when matches are more likely (optimistic). Nevertheless, when matches are likely, chances are you want to figure out what it was that matched - in which case the performance benefits are still reaped - as this is the fastest performing capturing method.
Do not use prefix_search_multi with optimistic matching, as it will provide minimal speed boosts and increase your memory usage.
In the future, an interface to provide a function callback would be handy - but in the case of optimistic matching, would be bad for performance as well
Prefixes may not exceed 256 bytes. You can increase this limit (at the cost of more memory) by changing the #define of CHARTABLE_MAX in the XS code and recompiling.
Doing some basic tests with use utf8; and non-ascii input, it seems to work as expected. The character tables and prefix tries work at the byte level, so no conversion is done for you. This is usually not an issue, but I am no unicode expert, so nag me if you find something wrong
In the case where the prefix count becomes insanely high (i.e. over 5,000), the performance of this module will begin to drop. This could probably be solved in the future by a bunch of different methods. Even at a 10,000 prefix count, it still remains on-par with perl regular expressions.
Copyright (C) 2011 M. Nunberg
You may use and distribute this software under the same terms, conditions, and licensing as Perl itself.
|perl v5.20.3||TEXT::PREFIX::XS (3)||2011-12-28|