NAME

Genezzo::Block::RDBlock.pm - Row Directory Block tied hash class. A class that lets you treat the contents of a block (byte buffer) as a hash.

Note: This implementation is almost, but not quite, a pushhash. The push hash implementation is Genezzo::Row::RSBlock. It also forms the basis of a tied array in Genezzo::Block::RDBArray.

SYNOPSIS

 use Genezzo::Block::RDBlock;
 use Genezzo::Block::Std;

 local $Genezzo::Block::Std::DEFBLOCKSIZE = 500;

 my $buff = "\0" x 500; # construct an empty byte buffer

 my %tied_hash = ();

 my $tie_val = 
     tie %tied_hash, 'Genezzo::Block::RDBlock', (refbufstr => \$buff);

 # pushhash style 
 # (note that the "PUSH" pseudo key is not supported)...
 my $newkey = $tie_val->HPush("this is a test");

 # or array style, your choice
 my $pushcount = $tie_val->PUSH(qw(push lots of data));

 $tied_hash{$newkey} = "update this entry";

 # a hash that supports array style FETCHSIZE
 my $getcount = $tie_val->FETCHSIZE(); # Note: not HCount

DESCRIPTION

RDBlock is the basis for persistent tied hashes, pushhashes, and tied arrays. After the hash is tied to the byte buffer, the buffer can be written to persistent storage. The storage is designed such that inserts/appends/pushes are fairly efficient, and deletes are inexpensive. The pctfree/pctused parameters allow some tuning to reserve space in the buffer for updates that "grow" existing values. Updates that do not change the packed size of data are about as efficient as insert/appends -- just the cost to copy your bytes into the buffer -- but updates that do change the size of stored values can require a large amount of byte shifting to open up storage space. Also, the buffer does not grow to accomodate large values. Wrapper classes are necessary to specify mechanisms for packing complex data structures and techniques to split objects across multiple buffers.

ARGUMENTS

refbufstr (Required) - a reference to the byte buffer used for storage.
blocksize (Optional) - the size of the supplied byte buffer. Default is $Genezzo::Block::Std::DEFBLOCKSIZE.
pctfree (Optional) - the percentage of space kept free for future updates. Default is 30 (percent).
pctused (Optional) - after the block is full, the percentage of space that must be open before inserts are re-enabled. Default is 50 (percent).

CONCEPTS

The structure and techniques for the Row Directory Block are described in Chapter 14, "The Tuple-Oriented File System", of "Transaction Processing: Concepts and Techniques" by Jim Gray and Andreas Reuter, 1993.

A tuple is a collection of values -- in the standard vernacular you would call it a "row" in a database. The refbufstr argument to the hash constructor is a "block", a fixed-size contiguous buffer of bytes. When you write ("STORE") a value into the RDBlock hash, it writes an entry into the block as a byte string, and reads ("FETCH") work in an analogous fashion.

The RDBlock data structures refer to stored values as "rows", but the basic "STORE" and "FETCH" only understand how to store and retrieve individual byte strings. Wrapper classes for RDBlock must marshall/unmarshall (Freeze/Thaw) between simple strings and more complex data structures.

The block has some header and footer information, plus a row directory, a data structure that records the offsets, extents, and status information of the stored row data. While the physical location of row data in a block may change as other rows are added, deleted or modified, the row keeps the same hash key.

Each row has an associated "status" bitfield, which is some combination of the following values:

deleted: set if row is deleted. Deleted rows are simply marked as deleted, but the physical storaged is not immediately recouped.
data: set for data rows, unset for metadata. All information stored via the standard public interfaces is data. You can manipulate the private interfaces to store "metadata", additional rows that describe, for example, block contents, transaction information, or data relationships, but are invisible to the public interfaces. By convention, row 0 is always a metadata row.
lock: set if row is locked. Not used in this base class -- provided for subclasses that must supply and maintain the appropriate metadata to identify locker and transaction information.
head: set if the stored value is the very first part of a row.
tail: set if the stored value is the very last part of a row. If "STORE" writes a complete value it sets both head and tail to true. The base class only writes rows that fit in a single block, so both head and tail are always set.
These flags are useful if you wish to write subclasses with rows that span multiple blocks. Neither head nor tail is set if only the middle section of a multi-part row is stored.
isnull: If you supply "STORE" with a value of undef, it writes a marker for a zero-length string and sets this flag. "FETCH" will correctly return an undef.
Note: When packing more complex data structures, make sure to use an encoding that distinguishes between undefs and zero-length strings. A simple scheme for packing an array of strings is to prefix the packed array with a bitstring that specifies which entries are null.

FUNCTIONS

RDBlock support all standard hash operations, with the exception that you cannot create or insert a user key -- you must push new entries and use the generated key or basic iteration to retrieve your data.

It also supports three additional public methods: an array style "PUSH" and "FETCHSIZE", plus a PushHash style HPush. Note that these methods are associated with the tie value (i.e. the blessed ref for the RDBlock class), not the tied hash. Finally, it has five "private" methods that may be of use in constructing subclasses: push_one, packdeleted, offset2hkey, lastkey, prevkey

PUSH this, LIST

PUSH appends the list to the end of the hash and returns the number of items it pushed.

FETCHSIZE

Returns the total number of valid, undeleted data items in the hash.

HPush

HPush returns the new key for each pushed value. It only accepts a single argument, not a list.

my $newkey = $tie_val->HPush("this is a test");

Note that there is not a corresponding "pop" operation.

EXPORT

DATAROW, RowStat

LIMITATIONS

The storage mechanism uses network longs (32 bits?) to describe the lengths of rows and offsets within the block. (That seems pretty large -- maybe it should use shorts to restrict blocksize and row piece length to 64K? Or init should take an optional module name for block type that lets us vary the row directory, header and footer sizing).

TODO

use row directory rowlen vs len/value for row storage
meta row - should binary search for meta id
unicode support

AUTHOR

Jeffrey I. Cohen, jcohen@genezzo.com