GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages
rwsplit(1) SiLK Tool Suite rwsplit(1)

rwsplit - Divide a SiLK file into a (sampled) collection of subfiles

  rwsplit --basename=BASENAME
        { --ip-limit=LIMIT | --flow-limit=LIMIT
          | --packet-limit=LIMIT | --byte-limit=LIMIT }
        [--seed=NUMBER] [--sample-ratio=SAMPLE_RATIO]
        [--file-ratio=FILE_RATIO] [--max-outputs=MAX_OUTPUTS]
        [--note-add=TEXT] [--note-file-add=FILE]
        [--compression-method=COMP_METHOD]
        [--print-filenames] [--site-config-file=FILENAME]
        [--xargs[=FILE] | FILE [FILES...]]

  rwsplit --help

  rwsplit --version

rwsplit reads SiLK Flow records from the standard input or from files named on the command line and writes the flows into a set of subfiles based on the splitting criterion. In its simplest form, rwsplit partitions the file, meaning that each input flow will appear in one (and only one) of the subfiles.

In addition to splitting the file, rwsplit can generate files containing sample flows. Sampling is specified by using the --sample-ratio and --file-ratio switches.

rwsplit reads SiLK Flow records from the files named on the command line or from the standard input when no file names are specified and --xargs is not present. To read the standard input in addition to the named files, use "-" or "stdin" as a file name. If an input file name ends in ".gz", the file is uncompressed as it is read. When the --xargs switch is provided, rwsplit reads the names of the files to process from the named text file or from the standard input if no file name argument is provided to the switch. The input to --xargs must contain one file name per line.

If you wish to use the size of the output files as the splitting criterion, use the --flow-limit switch. The paramater to this switch should be the size of the desired output files divided by the record size. The record size can be determined by rwfileinfo(1). When the output files are compressed (see the description of --compression-method below), you should assume about a 50% compression ratio.

Option names may be abbreviated if the abbreviation is unique or is an exact match for an option. A parameter to an option may be specified as --arg=param or --arg param, though the first form is required for options that take optional parameters.

The splitting criterion is defined using one of the limit specifiers; one and only one must be specified. They are:

--ip-limit=LIMIT
Close the current subfile and begin a new subfile when the count of unique source and destination IPs in the current subfile meets or exceeds LIMIT. The next-hop-IP does not count toward LIMIT.
--flow-limit=LIMIT
Close the current subfile and begin a new subfile when the number of SiLK Flow records in the current subfile meets LIMIT.
--packet-limit=LIMIT
Close the current subfile and begin a new subfile when the sum of the packet counts across all SiLK Flow records in the current subfile meets or exceeds LIMIT.
--byte-limit=LIMIT
Close the current subfile and begin a new subfile when the sum of the byte counts across all SiLK Flow records in the current subfile meets or exceeds LIMIT. This switch does not specify the size of the subfiles.

The other switches are:

--basename=BASENAME
Specifies the basename of the output files; this switch is required. The flows are written sequentially to a set of subfiles whose names follow the format BASENAME.ORDER.rwf, where ORDER is an 8-digit zero-formatted sequence number (i.e., 00000000, 00000001, and so on). The sequence number will begin at zero and increase by one for every file written, unless --file-ratio is specified,
--seed=NUMBER
Use NUMBER to seed the pseudo-random number generator for the --sample-ratio or --file-ratio switch. This can be used to put the random number generator into a known state, which is useful for testing.
--sample-ratio=SAMPLE_RATIO
Writes one flow record, chosen at random, from every SAMPLE_RATIO flows that are read.
--file-ratio=FILE_RATIO
Picks one subfile, chosen from random, out of every FILE_RATIO names generated, for writing to disk.
--max-outputs=NUMBER
Limits the number of files that are written to disk to NUMBER.
--note-add=TEXT
Add the specified TEXT to the header of the output file as an annotation. This switch may be repeated to add multiple annotations to a file. To view the annotations, use the rwfileinfo(1) tool.
--note-file-add=FILENAME
Open FILENAME and add the contents of that file to the header of the output file as an annotation. This switch may be repeated to add multiple annotations. Currently the application makes no effort to ensure that FILENAME contains text; be careful that you do not attempt to add a SiLK data file as an annotation.
--compression-method=COMP_METHOD
Specify the compression library to use when writing output files. If this switch is not given, the value in the SILK_COMPRESSION_METHOD environment variable is used if the value names an available compression method. When no compression method is specified, the output files are compressed using the default chosen when SiLK was compiled. The valid values for COMP_METHOD are determined by which external libraries were found when SiLK was compiled. To see the available compression methods and the default method, use the --help or --version switch. SiLK can support the following COMP_METHOD values when the required libraries are available.
none
Do not compress the output using an external library.
zlib
Use the zlib(3) library for compressing the output. Using zlib produces the smallest output files at the cost of speed.
lzo1x
Use the lzo1x algorithm from the LZO real time compression library for compression. This compression provides good compression with less memory and CPU overhead.
snappy
Use the snappy library for compression, and always compress the output regardless of the destination. This compression provides good compression with less memory and CPU overhead. Since SiLK 3.13.0.
best
Use lzo1x if available, otherwise use snappy if available, otherwise use zlib if available.
--print-filenames
Print to the standard error the names of input files as they are opened.
--site-config-file=FILENAME
Read the SiLK site configuration from the named file FILENAME. When this switch is not provided, rwsplit searches for the site configuration file in the locations specified in the "FILES" section.
--xargs
--xargs=FILENAME
Read the names of the input files from FILENAME or from the standard input if FILENAME is not provided. The input is expected to have one filename per line. rwsplit opens each named file in turn and reads records from it as if the filenames had been listed on the command line.
--help
Print the available options and exit.
--version
Print the version number and information about how SiLK was configured, then exit the application.

In the following examples, the dollar sign ("$") represents the shell prompt. The text after the dollar sign represents the command line. Lines have been wrapped for improved readability, and the back slash ("\") is used to indicate a wrapped line.

Assume a source file source.rwf; to split that file into files that each contain about 100 unique IP addresses:

 $ rwsplit --basename=result --ip-limit=100 source.rwf

To split source.rwf into files that each contain 100 flows:

 $ rwsplit --basename=result --flow-limit=100 source.rwf

The following causes rwsplit to sample 1 out of every 10 records from source.rwf; i.e., rwsplit will read 1000 flow records to produce each subfile:

 $ rwsplit --basename=result --flow-limit=100 --sample-ratio=10 source.rwf

When --file-ratio is specified, the file names are generated as usual (e.g., base-00000000, base-00000001, ...); however, one of these names will be chosen randomly from each set of --file-ratio candidates, and only that file will be written to disk.

 $ rwsplit --basename=result --flow-limit=100 --file-ratio=5 source.rwf
 $ ls
 result-00000002.rwf
 result-00000008.rwf
 result-00000013.rwf
 result-00000016.rwf

rwsplit can take exactly 1 partitioning switch per invocation.

Partitioning is not exact, rwsplit keeps appending flow records a file until it meets or exceeds the specified LIMIT. For example, if you specify --ip-limit=100, then rwsplit will fill up the file until it has 100 IP addresses in it; if the file has 99 addresses and a new record with 2 previously unseen addresses is received, rwsplit will put this in the current file, resulting in a 101-address file. Similarly, if you specify --byte-limit=2000, and rwsplit receives a 10kb flow record, that flow record will be placed in the current subfile.

The switches --sample-ratio, --file-ratio, and --max-outputs are processed in that order. So, when you specify

 $ rwsplit --sample-ratio=10 --ip-limit=100    \
        --file-ratio=10 --max-outputs=20

rwsplit will pick 1 out of every 10 flow records, write that to a file until it has 100 IP's per file, pick 1 out of every 10 files to write, and write up to 20 files. If there are 1000 records, each with 2 unique IPs in them, then rwsplit will write at most 1 file (it will write 200 unique IP addresses, but it may not pick one of the files from the set to write).

SILK_CLOBBER
The SiLK tools normally refuse to overwrite existing files. Setting SILK_CLOBBER to a non-empty value removes this restriction.
SILK_COMPRESSION_METHOD
This environment variable is used as the value for --compression-method when that switch is not provided. Since SiLK 3.13.0.
SILK_CONFIG_FILE
This environment variable is used as the value for the --site-config-file when that switch is not provided.
SILK_DATA_ROOTDIR
This environment variable specifies the root directory of data repository. As described in the "FILES" section, rwsplit may use this environment variable when searching for the SiLK site configuration file.
SILK_PATH
This environment variable gives the root of the install tree. When searching for configuration files, rwsplit may use this environment variable. See the "FILES" section for details.

${SILK_CONFIG_FILE}
${SILK_DATA_ROOTDIR}/silk.conf
/data/silk.conf
${SILK_PATH}/share/silk/silk.conf
${SILK_PATH}/share/silk.conf
/usr/local/share/silk/silk.conf
/usr/local/share/silk.conf
Possible locations for the SiLK site configuration file which are checked when the --site-config-file switch is not provided.

rwfileinfo(1), silk(7), zlib (3)
2022-04-12 SiLK 3.19.1

Search for    or go to Top of page |  Section 1 |  Main Index

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with ManDoc.