GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  MCE::GREP (3)

.ds Aq ’

NAME

MCE::Grep - Parallel grep model similar to the native grep function

CONTENTS

VERSION

This document describes MCE::Grep version 1.703

SYNOPSIS



   ## Exports mce_grep, mce_grep_f, and mce_grep_s
   use MCE::Grep;

   ## Array or array_ref
   my @a = mce_grep { $_ % 5 == 0 } 1..10000;
   my @b = mce_grep { $_ % 5 == 0 } [ 1..10000 ];

   ## File_path, glob_ref, or scalar_ref
   my @c = mce_grep_f { /pattern/ } "/path/to/file";
   my @d = mce_grep_f { /pattern/ } $file_handle;
   my @e = mce_grep_f { /pattern/ } \$scalar;

   ## Sequence of numbers (begin, end [, step, format])
   my @f = mce_grep_s { %_ * 3 == 0 } 1, 10000, 5;
   my @g = mce_grep_s { %_ * 3 == 0 } [ 1, 10000, 5 ];

   my @h = mce_grep_s { %_ * 3 == 0 } {
      begin => 1, end => 10000, step => 5, format => undef
   };



DESCRIPTION

This module provides a parallel grep implementation via Many-Core Engine. MCE incurs a small overhead due to passing of data. A fast code block will run faster natively. However, the overhead will likely diminish as the complexity increases for the code.



   my @m1 =     grep { $_ % 5 == 0 } 1..1000000;          ## 0.065 secs
   my @m2 = mce_grep { $_ % 5 == 0 } 1..1000000;          ## 0.194 secs



Chunking, enabled by default, greatly reduces the overhead behind the scene. The time for mce_grep below also includes the time for data exchanges between the manager and worker processes. More parallelization will be seen when the code incurs additional CPU time.



   my @m1 =     grep { /[2357][1468][9]/ } 1..1000000;    ## 0.353 secs
   my @m2 = mce_grep { /[2357][1468][9]/ } 1..1000000;    ## 0.218 secs



Even faster is mce_grep_s; useful when input data is a range of numbers. Workers generate sequences mathematically among themselves without any interaction from the manager process. Two arguments are required for mce_grep_s (begin, end). Step defaults to 1 if begin is smaller than end, otherwise -1.



   my @m3 = mce_grep_s { /[2357][1468][9]/ } 1, 1000000;  ## 0.165 secs



Although this document is about MCE::Grep, the MCE::Stream module can write results immediately without waiting for all chunks to complete. This is made possible by passing the reference to an array (in this case @m4 and @m5).



   use MCE::Stream default_mode => grep;

   my @m4; mce_stream \@m4, sub { /[2357][1468][9]/ }, 1..1000000;

      ## Completed in 0.203 secs. This is amazing considering the
      ## overhead for passing data between the manager and workers.

   my @m5; mce_stream_s \@m5, sub { /[2357][1468][9]/ }, 1, 1000000;

      ## Completed in 0.120 secs. Like with mce_grep_s, specifying a
      ## sequence specification turns out to be faster due to lesser
      ## overhead for the manager process.



A common scenario is grepping for pattern(s) inside a massive log file. Notice how parallelism increases as complexity increases for the pattern. Testing was done against a 300 MB file containing 250k lines.



   use MCE::Grep;

   my @m; open my $LOG, "<", "/path/to/log/file" or die "$!\n";

   @m = grep { /pattern/ } <$LOG>;                      ##  0.756 secs
   @m = grep { /foobar|[2357][1468][9]/ } <$LOG>;       ## 24.681 secs

   ## Parallelism with mce_grep. This involves the manager process
   ## due to processing a file handle.

   @m = mce_grep { /pattern/ } <$LOG>;                  ##  0.997 secs
   @m = mce_grep { /foobar|[2357][1468][9]/ } <$LOG>;   ##  7.439 secs

   ## Even faster with mce_grep_f. Workers access the file directly
   ## with zero interaction from the manager process.

   my $LOG = "/path/to/file";
   @m = mce_grep_f { /pattern/ } $LOG;                  ##  0.112 secs
   @m = mce_grep_f { /foobar|[2357][1468][9]/ } $LOG;   ##  6.840 secs



PARSING HUGE FILES

The MCE::Grep module lacks an optimization for quickly determining if a match is found from not knowing the pattern inside the code block. Use the following snippet as a template to achieve better performance. Also, take a look at examples/egrep.pl, included with the distribution.



   use MCE::Loop;

   MCE::Loop::init {
      max_workers => 8, use_slurpio => 1
   };

   my $pattern  = karl;
   my $hugefile = very_huge.file;

   my @result = mce_loop_f {
      my ($mce, $slurp_ref, $chunk_id) = @_;

      ## Quickly determine if a match is found.
      ## Process slurped chunk only if true.

      if ($$slurp_ref =~ /$pattern/m) {
         my @matches;

         ## The following is fast on Unix. Performance degrades
         ## drastically on Windows beyond 4 workers.

         open my $MEM_FH, <, $slurp_ref;
         binmode $MEM_FH, :raw;
         while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
         close   $MEM_FH;

         ## Therefore, use the following construct on Windows.

         while ( $$slurp_ref =~ /([^\n]+\n)/mg ) {
            my $line = $1; # save $1 to not lose the value
            push @matches, $line if ($line =~ /$pattern/);
         }

         ## Gather matched lines.

         MCE->gather(@matches);
      }

   } $hugefile;

   print join(, @result);



OVERRIDING DEFAULTS

The following list options which may be overridden when loading the module.



   use Sereal qw( encode_sereal decode_sereal );
   use CBOR::XS qw( encode_cbor decode_cbor );
   use JSON::XS qw( encode_json decode_json );

   use MCE::Grep
         max_workers => 4,                ## Default auto
         chunk_size => 100,               ## Default auto
         tmp_dir => "/path/to/app/tmp",   ## $MCE::Signal::tmp_dir
         freeze => \&encode_sereal,       ## \&Storable::freeze
         thaw => \&decode_sereal          ## \&Storable::thaw
   ;



There is a simpler way to enable Sereal. The following will attempt to use Sereal if available, otherwise defaults to Storable for serialization.



   use MCE::Grep Sereal => 1;



CUSTOMIZING MCE

MCE::Grep->init ( options )
MCE::Grep::init { options } The init function accepts a hash of MCE options. The gather option, if specified, is ignored due to being used internally by the module.



   use MCE::Grep;

   MCE::Grep::init {
      chunk_size => 1, max_workers => 4,

      user_begin => sub {
         print "## ", MCE->wid, " started\n";
      },

      user_end => sub {
         print "## ", MCE->wid, " completed\n";
      }
   };

   my @a = mce_grep { $_ % 5 == 0 } 1..100;

   print "\n", "@a", "\n";

   -- Output

   ## 2 started
   ## 3 started
   ## 1 started
   ## 4 started
   ## 3 completed
   ## 4 completed
   ## 1 completed
   ## 2 completed

   5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100



API DOCUMENTATION

MCE::Grep->run ( sub { code }, iterator )
mce_grep { code } iterator An iterator reference can by specified for input_data. Iterators are described under SYNTAX for INPUT_DATA at MCE::Core.



   my @a = mce_grep { $_ % 3 == 0 } make_iterator(10, 30, 2);



MCE::Grep->run ( sub { code }, list )
mce_grep { code } list Input data can be defined using a list.



   my @a = mce_grep { /[2357]/ } 1..1000;
   my @b = mce_grep { /[2357]/ } [ 1..1000 ];



MCE::Grep->run_file ( sub { code }, file )
mce_grep_f { code } file The fastest of these is the /path/to/file. Workers communicate the next offset position among themselves without any interaction from the manager process.



   my @c = mce_grep_f { /pattern/ } "/path/to/file";
   my @d = mce_grep_f { /pattern/ } $file_handle;
   my @e = mce_grep_f { /pattern/ } \$scalar;



MCE::Grep->run_seq ( sub { code }, $beg, $end [, $step, $fmt ] )
mce_grep_s { code } $beg, $end [, $step, $fmt ] Sequence can be defined as a list, an array reference, or a hash reference. The functions require both begin and end values to run. Step and format are optional. The format is passed to sprintf (% may be omitted below).



   my ($beg, $end, $step, $fmt) = (10, 20, 0.1, "%4.1f");

   my @f = mce_grep_s { /[1234]\.[5678]/ } $beg, $end, $step, $fmt;
   my @g = mce_grep_s { /[1234]\.[5678]/ } [ $beg, $end, $step, $fmt ];

   my @h = mce_grep_s { /[1234]\.[5678]/ } {
      begin => $beg, end => $end, step => $step, format => $fmt
   };



MANUAL SHUTDOWN

MCE::Grep->finish
MCE::Grep::finish Workers remain persistent as much as possible after running. Shutdown occurs automatically when the script terminates. Call finish when workers are no longer needed.



   use MCE::Grep;

   MCE::Grep::init {
      chunk_size => 20, max_workers => auto
   };

   my @a = mce_grep { ... } 1..100;

   MCE::Grep::finish;



INDEX

MCE, MCE::Core

AUTHOR

Mario E. Roy, <marioeroy AT gmail DOT com>
Search for    or go to Top of page |  Section 3 |  Main Index


perl v5.20.3 MCE::GREP (3) 2016-03-20

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.