cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (Commented) (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-3708) Support "composite prefix" tombstones
Date Fri, 20 Apr 2012 15:20:41 GMT


Sylvain Lebresne commented on CASSANDRA-3708:

I've updated my branch at to add efficient
on-disk handling of the new range tombstones.

The idea is that we don't want to have to read every range tombstone for each query, but only
the ones corresponding to the columns queried. For that, the idea is to write the range tombstone
along with the columns themselves. So the basic principal of the patch is that if we have
a range tombstone RT[x, y] deleting all columns between x and y, we write a tombstone marker
on disk before column x. Of course in practice that's more complicated because we want to
be sure to read that tombstone even if we read only say y. To ensure that, such tombstone
marker is repeated at the beginning of every column block (index block) the range covers (the
code is smart enough to not repeat a marker that is superseded by other ones so there won't
be a lot of such repeated marker at the beginning of each block in practice).

Note that those tombstone marker are only specific for the on-disk format (in memory we use
an interval tree), which has 2 consequences for the patch:
# the on-disk format now diverges a little bit from the wire format. So the code separates
(hopefullly) cleanly serialization functions that deal with on-disk format from the others.
I don't think it's a bad idea to have that distinction anyway since we don't want to break
the wire protocol but it's ok to change the on-disk one.
# on-disk column iterators (SSTable{Slice,Name}Iterator) have to handle those tombstone markers
that are not columns per-se. I.e, after having read them from disk we want to store them in
the interval tree of the ColumnFamily object, not as an IColumn in the ColumnFamily map. To
do this distinction, the code introduces an interface called OnDiskAtom, which represent basically
either a column or a range tombstone. And the sstable iterators return those OnDiskAtom which
are then ultimately added correctly to the resulting ColumnFamily object. I do think this
is the clean way to handle this, but this is responsible for quite a bit of code diffs.

I'll also note that both those changes should be useful for CASSANDRA-4180 too to handle the
end-of-row marker described in that issue.

Now I admit this patch is not a small one, but unit tests are passing and there is a few basic
tests at

Lastly, I'll add that the support for this by CQL3 is minimal as of this patch. We only allow
what is basically the equivalent of the 'delete a whole super column' behavior. But it would
be simple to allow for more generic use of range tombstones, i.e to allow stuff like:
DELETE FROM test WHERE k=0 AND c > 3 and c <= 10
But the patch is big enough that we can see that later.

> Support "composite prefix" tombstones
> -------------------------------------
>                 Key: CASSANDRA-3708
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Jonathan Ellis
>            Assignee: Sylvain Lebresne

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message