cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn Hegerfors (JIRA) <>
Subject [jira] [Commented] (CASSANDRA-8243) DTCS can leave time-overlaps, limiting ability to expire entire SSTables
Date Thu, 13 Nov 2014 17:31:34 GMT


Björn Hegerfors commented on CASSANDRA-8243:

An expired column is equivalent to a tombstone with the same timestamp in Cassandra's eyes,
right? Compactions even turn them into tombstones, if they can't be immediately purged. So
to simplify, we're dealing with all-tombstone SSTables. Both the old and new implementation
agree that removing an SSTable can only happen if the oldest SSTable (the one with lowest
minTimestamp) is all-tombstones (= has fully expired). Both implementations also agree that
this oldest SSTable may not overlap (in time span) with an SSTable containing any non-tombtone
data. If there is no such overlap, everything in any SSTable (with an overlapping row range,
anyway) written with a timestamp less than or equal to this oldest table's maxTimestamp is
guaranteed to be a tombstone.

Also, since any SSTable that either of the implementations remove is an all-tombstone SSTable,
the only thing that can happen is that something is resurrected. Combined with the reasoning
in my previous paragraph, the only thing that could be resurrected when a tombstone for column
x with timestamp t is removed is another tombstone for column x, with a lower timestamp t'!
When could that matter? Only if some other SSTable makes a constructive write to column x
in the interval (t', t]. But that's impossible, because that would then be an SSTable containing
some non-tombstone data with a minTimestamp less than or equal to the oldest SSTable's maxTimestamp,
which goes against the assumption that no such SSTable exists!

There you have a proof by contradiction that the oldest SSTable can be safely removed if it
is all-tombstones and doesn't overlap with any SSTable containing any non-tombstone data.
If we then consider the oldest SSTable free to remove, the same rules apply to the oldest
remaining SSTable and so on. This is the rule that my implementation uses. From the comments
it looks like we already agree intuitively on this, but I though a more formal proof like
this might help this get committed. [~slebresne] any reason to still not submit this patch
to 2.0?

Oh, and I noticed that I didn't update the Javadoc, so here comes a new patch.

> DTCS can leave time-overlaps, limiting ability to expire entire SSTables
> ------------------------------------------------------------------------
>                 Key: CASSANDRA-8243
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Björn Hegerfors
>            Assignee: Björn Hegerfors
>            Priority: Minor
>              Labels: compaction, performance
>             Fix For: 2.0.12, 2.1.3
>         Attachments: cassandra-trunk-CASSANDRA-8243-aggressiveTTLExpiry.txt, cassandra-trunk-CASSANDRA-8243-aggressiveTTLExpiry.txt
> CASSANDRA-6602 (DTCS) and CASSANDRA-5228 are supposed to be a perfect match for tables
where every value is written with a TTL. DTCS makes sure to keep old data separate from new
data. So shortly after the TTL has passed, Cassandra should be able to throw away the whole
SSTable containing a given data point.
> CASSANDRA-5228 deletes the very oldest SSTables, and only if they don't overlap (in terms
of timestamps) with another SSTable which cannot be deleted.
> DTCS however, can't guarantee that SSTables won't overlap (again, in terms of timestamps).
In a test that I ran, every single SSTable overlapped with its nearest neighbors by a very
tiny amount. My reasoning for why this could happen is that the dumped memtables were already
overlapping from the start. DTCS will never create an overlap where there is none. I surmised
that this happened in my case because I sent parallel writes which must have come out of order.
This was just locally, and out of order writes should be much more common non-locally.
> That means that the SSTable removal optimization may never get a chance to kick in!
> I can see two solutions:
> 1. Make DTCS split SSTables on time window borders. This will essentially only be done
on a newly dumped memtable once every base_time_seconds.
> 2. Make TTL SSTable expiry more aggressive. Relax the conditions on which an SSTable
can be dropped completely, of course without affecting any semantics.

This message was sent by Atlassian JIRA

View raw message