cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-6271) Replace SnapTree in AtomicSortedColumns
Date Thu, 02 Jan 2014 19:07:51 GMT


Benedict commented on CASSANDRA-6271:

It sort of does, it just does it all in a batch and does it to a new copy.

1. It finds the appropriate leaf (but scanning from the previous insertion point, instead
of the root of the tree)
1a. It copies any elements between the previous insertion point and the new insertion point
into the new tree (necessary because we're immutable, but does not affect semantics)
2. Puts it in the leaf if there is room (if not splits), but since it can be inserted ahead
of some pre-existing elements being copied to the new version, the split may be delayed until
*they* are copied forwards in the next (1a) step. These later copies can also be thought of
as inserts.
3. Splits the parent if necessary  (in the addChild method, which is called whenever we have
a completed node to pass to the parent)

It can also, perhaps more easily, be thought of as performing an optimised insert of the union
of all elements in the original tree with the updating collection. Instead of going through
each item in the original tree, when a sub-tree range doesn't intersect with the updating
collection it doesn't descend into the tree, but copies it as is (semantically equivalent
to inserting them all). 

If they do intersect, it descends and performs the optimised insert on the sub-tree. The optimised
insert on a leaf is simply a binary search for the insertion point, with an array copy to
move any leaves between the last insertion point and the new insertion point. If our next
insert takes as back up out of a sub-tree/leaf, we just finish inserting all of the remaining
elements from the original tree, spilling up when necessary as with any insert.

If at any point any of the inserts causes an overflow we "split" by creating a node with half
of all elements we've buffered to a new node and adding it immediately to the parent. This
can happen an arbitrary number of times for a sub-tree, since we don't bound the size of the
updating collection, a huge portion of which which could intersect with a given sub-tree.

> Replace SnapTree in AtomicSortedColumns
> ---------------------------------------
>                 Key: CASSANDRA-6271
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benedict
>            Assignee: Benedict
>              Labels: performance
>         Attachments: oprate.svg
> On the write path a huge percentage of time is spent in GC (>50% in my tests, if accounting
for slow down due to parallel marking). SnapTrees are both GC unfriendly due to their structure
and also very expensive to keep around - each column name in AtomicSortedColumns uses >
100 bytes on average (excluding the actual ByteBuffer).
> I suggest using a sorted array; changes are supplied at-once, as opposed to one at a
time, and if < 10% of the keys in the array change (and data equal to < 10% of the size
of the key array) we simply overlay a new array of changes only over the top. Otherwise we
rewrite the array. This method should ensure much less GC overhead, and also save approximately
80% of the current memory overhead.
> TreeMap is similarly difficult object for the GC, and a related task might be to remove
it where not strictly necessary, even though we don't keep them hanging around for long. TreeMapBackedSortedColumns,
for instance, seems to be used in a lot of places where we could simply sort the columns.

This message was sent by Atlassian JIRA

View raw message