cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Tunnicliffe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-14513) Reverse order queries in presence of range tombstones may cause permanent data loss
Date Tue, 12 Jun 2018 15:57:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-14513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509790#comment-16509790
] 

Sam Tunnicliffe commented on CASSANDRA-14513:
---------------------------------------------

The problem manifests when executing a slice query with reverse ordering against an indexed
partition if the upper bound of the query precedes the first clustering in the partition for
a given SSTable.

The initial search of the index correctly identifies that the slice bounds are not contained
within the partition and {{ReverseIndexedReader::setForSlice}} returns an empty iterator.
However, it doesn’t update the pointer to the current index block in {{IndexState}}. The
pointer remains set to the size of the column index, so that when the initial empty iterator
is exhausted {{ReversedIndexReader::hasNextInternal}} incorrectly assumes that there is more
to do, bumps the pointer back one to the last index block and starts reading.

If a range tombstone spans the boundary between the penultimate and final index blocks, the
iterator will emit the end marker after first altering the bounds to match those of the query.
The assumption made is that only data that falls within the bounds of the query slice will
be read from disk and so adjusting the tombstone bounds in this way is simply a narrowing
of the range tombstone. The index block pointer bug invalidates this assumption and so a wholly
new and invalid marker is generated.

On a single node this new marker alone can shadow live data in other sstables, but the effect
is transient. A tombstone never gets written to disk and when the SSTable is compacted, the
layout of the partition on disk will _likely_ no longer trigger the bug (though is no guarantee
of this).

In a multi-node scenario read repair can cause the erroneous marker to be matched to an (unrelated)
marker from another replica, creating a new tombstone, potentially with a very wide range.
This is then propagated to all replicas, causing data loss from the partition.

> Reverse order queries in presence of range tombstones may cause permanent data loss
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14513
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14513
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core, CQL, Local Write-Read Paths
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>            Priority: Blocker
>             Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> Slice queries in descending sort order can create oversized artificial range tombstones.
At CL > ONE, read repair can propagate these tombstones to all replicas, wiping out vast
data ranges that they mistakenly cover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message