cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Podkowinski (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-11349) MerkleTree mismatch when multiple range tombstones exists for the same partition and interval
Date Wed, 23 Mar 2016 12:37:25 GMT


Stefan Podkowinski commented on CASSANDRA-11349:

I gave the patch some more thoughts and I'm now confident that the proposed change is the
best way to address the issue. 

Basically what happens during validation compaction is that a scanner is created for each
sstable. The {{CompactionIterable.Reducer}} will then create a {{LazilyCompactedRow}} with
an iterable of {{OnDiskAtom}} for the same key in each sstable. The purpose of {{LazilyCompactedRow}}
during validation compaction is to create a digest of the compacted version of all atoms that
would represent a single row. This is done cell by cell, where each collection of atoms for
a single cell name is consumed by {{LazilyCompactedRow.Reducer}}.
 The decision on whether {{LazilyCompactedRow.Reducer}} should finish to merge a cell and
move to the next one is currently being done by {{AbstractCellNameType.onDiskAtomComparator}},
as evaluated by {{MergeIterator.ManyToOne}}. However, the comparator does not only compare
by name, but also by {{DeletionTime}} in case of {{RangeTombstone}}. As a consequence, {{MergeIterator.ManyToOne}}
will advance in case two {{RangeTombstone}} with different deletion times are read, which
breaks the "_will be called one or more times with cells that share the same column name_"
contract in {{LazilyCompactedRow.Reducer}}.

The submitted patch will introduce a new {{Comparator<OnDiskAtom>}} that will basically
work like {{onDiskAtomComparator}}, but does not compare deletion time. As simple as that.


The only other places where {{LazilyCompactedRow}} is being used except validation compaction
are the cleanup and scrub functions, which shouldn't be affected, as those are working on
individual sstables and I assume that there's no case where an sstable can have multiple identical
range tombstones with different timestamps.

> MerkleTree mismatch when multiple range tombstones exists for the same partition and
> ---------------------------------------------------------------------------------------------
>                 Key: CASSANDRA-11349
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Fabien Rousseau
>            Assignee: Stefan Podkowinski
> We observed that repair, for some of our clusters, streamed a lot of data and many partitions
were "out of sync".
> Moreover, the read repair mismatch ratio is around 3% on those clusters, which is really
> After investigation, it appears that, if two range tombstones exists for a partition
for the same range/interval, they're both included in the merkle tree computation.
> But, if for some reason, on another node, the two range tombstones were already compacted
into a single range tombstone, this will result in a merkle tree difference.
> Currently, this is clearly bad because MerkleTree differences are dependent on compactions
(and if a partition is deleted and created multiple times, the only way to ensure that repair
"works correctly"/"don't overstream data" is to major compact before each repair... which
is not really feasible).
> Below is a list of steps allowing to easily reproduce this case:
> {noformat}
> ccm create test -v 2.1.13 -n 2 -s
> ccm node1 cqlsh
> CREATE KEYSPACE test_rt WITH replication = {'class': 'SimpleStrategy', 'replication_factor':
> USE test_rt;
>     c1 text,
>     c2 text,
>     c3 float,
>     c4 float,
>     PRIMARY KEY ((c1), c2)
> );
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 2);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> # now flush only one of the two nodes
> ccm node1 flush 
> ccm node1 cqlsh
> USE test_rt;
> INSERT INTO table1 (c1, c2, c3, c4) VALUES ( 'a', 'b', 1, 3);
> DELETE FROM table1 WHERE c1 = 'a' AND c2 = 'b';
> ctrl ^d
> ccm node1 repair
> # now grep the log and observe that there was some inconstencies detected between nodes
(while it shouldn't have detected any)
> ccm node1 showlog | grep "out of sync"
> {noformat}
> Consequences of this are a costly repair, accumulating many small SSTables (up to thousands
for a rather short period of time when using VNodes, the time for compaction to absorb those
small files), but also an increased size on disk.

This message was sent by Atlassian JIRA

View raw message