NAME

VCP::Filter::changesets - Group revs in to changesets

SYNOPSIS

  ## From the command line:
   vcp <source> changesets: ...options... -- <dest>

  ## In a .vcp file:

    ChangeSets:
       time                     <=60     ## seconds
       user_id                  equal    ## case-sensitive equality
       comment                  equal    ## case-sensitive equality
       source_filebranch_id     notequal ## case-sensitive inequality

DESCRIPTION

This filter is automatically loaded when there is no sort filter loaded (both this and VCP::Filter::sort count as sort filters).

Sorting by change_id, etc.

When all revs from the source have change numbers, this filter sorts by change_id, branch_id, and name, regardless of the rules set. The name sort is case sensitive, though it should not be for Win32. This sort by change_id is necessary for sources that supply change_id because the order of scanning the revisions is not usually (ever, so far :) in change set order.

Aggregating changes

If one or more revisions arrives from the source with an empty change_id, the rules for this filter establish the conditions that determine what revisions may be grouped in to each change.

In this case, this filter rewrites all change_id fields so that the (eventual) destination can use the change_id field to break the revisions in to changes. This is sometimes used by non-changeset oriented destinations to aggregate "changes" as though a user were performing them and to reduce the number of individual operations the destination driver must perform (for instance: VCP::Dest::cvs prefers to not call cvs commit all the time; cvs commit is slow).

Revisions are aggregated in to changes using a set of rules that determine what revisions may be combined. One rule is implicit in the algorithm, the others are explicitly specified as a set of defaults that may be altered by the user.

The Implicit Rule

The implicit rule is that no change may contain two revisions where one is a descendant of another. The algorithm starts with the set of revisions that have no parents in this transfer, chooses a set of them to be a change according to the explicit conditions, and emits it. Only when a revision is emitted does this filter consider it's offspring for emission. This cannot be changed.

(EXPERIMENTAL) The only time this implicit rule is not enough is in a cloning situation. In CVS and VSS, it is possible to "share" files between branches. VSS supports and promotes this model in its user interface and documentation while CVS allows it more subtlely by allowing the same branch to have multiple branch tags. In either case, there are multiple branches of a file that are changed simultaneously. The CVS source recognizes this (and the VSS source may by the time you read this) and chooses a master revision from which to "clone" other revisions. These cloned revisions appear on the child branch as children of the master revision, not as children of the preceding revision on the child branch. This is confusing, but it works. In order to prevent this from confusing the destinations, however, it can be important to make sure that two revisions to a given branch of a given file do not occur in the same revision; this is the purpose of the explicit rule "source_filebranch_id notequal", covered below.

The Explicit Rules

Rules may be specified for the ChangeSets filter. If no rules are specified, a set of default rules are used. If any rules are specified, none of the default rules are used. The default rules are explained after rule conditions are explained.

Each rule is a pair of words: a data field and a condition.

There are three conditions: "notequal", "equal" and "<=N" (where N is a number; note that no spaces are allowed before the number unless the spec is quoted somehow):

equal

The "equal" condition is valid for all fields and states that all revisions in the same change must have identical values for the indicated field. So:

    user_id                  equal

states that all revisions in a change must be submitted by the same user.

All "equal" conditions are used before any other conditions, regardless of the order they are specified in to categorize revisions in to prototype changes. Once all revisions have been categorized in to prototyps changes, the "<=N" and "notequal" rules are applied in order to split the change prototypes in to as many changes as are needed to satisfy them.

notequal

The "notequal" condition is also valid for all fields and specifies that no two revisions in a change may have equal values for a field. It does not make sense to apply this to time fields, and is usually only needed to ensure that two revisions to the same file on the same branch do not get bundled in to the same change.

<=N

The "<=N" specification is only available for the "time" field. It specifices that no gaps larger than N seconds may exist in a change.

The default rules are:

    time                     <=60     ## seconds
    user_id                  equal    ## case-sensitive equality
    comment                  equal    ## case-sensitive equality
    source_filebranch_id     notequal ## case-sensitive inequality

These rules

The "time <=60" condition sets a maximum allowable difference between two revisions; revisions that are more than this number of seconds apart are considered to be in different changes.

The "user_id equal" and "comment equal" conditions assert that two revisions must be by the same user and have the same comment in order to be in the same change.

The "source_filebranch_id notequal" condition prevents cloned revs of a file from appearing in the same change as eachother (see the discussion above for more details).

ALGORITHM

handle_rev()

As revs are received by handle_rev(), they are store on disk. Several RAM-efficient (well, for Perl) data structures are built, however, that describe each revision's children and its membership in a changeset. Some or all of these structures may be moved to disk when we need to handly truly large data sets.

The ALL_HAVE_CHANGE_IDS statistic

One statistic that handle_rev() gathers is whether or not all revisions arrived with a non-empty change_id field.

The REV_COUNT statistic

How many revisions have been recieved. This is used only for UI feedback; primarily it is to forewarn the downstream filter(s) and destination of how many revisions will constitute a 100% complete transfer.

The CHANGES list

As each rev arrives, it is placed in a "protochange" determined solely by the revision's fields in the rules list with an "equal" condition. Protochanges are likely to have too many revisions in them, including revisions that descend from one another and revisions that are too far apart in time.

The CHANGES_BY_KEY index

The categorization of each revision in to changes is done by forming a key string from all the fields in the rules list with the "equal" condition. This index maps unique keys to changes.

The CHILDREN index

This is an index of all revisions that are direct offspring of a revision.

the PREDECESSOR_COUNT statistic

Counts the number of parents a revision has that haven't been submitted yet. A revision may have a previous_id and, optionally, also have a from_id (can't have a from_id without a previous_id, however).

The REVS_BY_CHANGE_ID index

If all revs do indeed arrive with change_ids, they need to be sorted and sent out in order. This index is gathered until the first rev with an empty change_id arrives.

The ROOT_IDS list

This is a list of the IDs of all revisions that have no parent revisions in this transfer. This is used as the starting point for send_changes(), below.

The CHANGES_BY_REV index

As the large protochanges are split in to smaller ones, the resulting CHANGES list is indexed by, among other things, which revs are in the change. This is so the algorithms can quickly find what change a revision is in when it's time to consider sending that revision.

handle_footer()

All the real work occurs when handle_footer() is called. handle_footer() glances at the change_id statistic gathered by handle_rev() and determines whether it can sort by change_id or whether it has to perform change aggregation.

If all revisions arrive with a change_id, sort_by_change_id_and_send() If at least one revision didn't handle_footer() decides to perform change aggregation by calling split_protochanges() and then send_changes().

Any source or upstream filter may perform change aggregation by assigning change_ids to all revisions. VCP::Source::p4 does this. At the time of this writing no otherd do.

Likewise, a filter like VCP::Filter::StringEdit may be used to clear out all the change_ids and force change aggregation.

sort_by_change_id_and_send()

If all revisions arrived with a change_id, then they will be sorted by the values of ( change_id, time, branch_id, name ) and sent on. There is no provision in this filter for ignoring change_id other than if any revisions arrive with an empty change_id, this sort is not done.

split_and_send_changes()

Once all revisions have been placed in to protochanges, a change is selected and sent like so:

1.: Get an oldest change with no revs that can't yet be sent. If none is found, then select one oldest change and remove any revs that can't be sent yet.
2.: Select as many revs as can legally be sent in a change by sorting them in to time order and then using the <=N and notequal rules to determine if each rev can be sent given the revs that have already passed the rules. Delay all other revs for a later change.

LIMITATIONS

This filter does not take the source_repo_id in to account: if somehow you are merging multiple repositories in to one and want to interleave the commits/submits "properly", ask for advice.

AUTHOR

Barrie Slaymaker <barries@slaysys.com>

COPYRIGHT

See VCP::License ("vcp help license") for the terms of use.