cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Schuller (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2643) read repair/reconciliation breaks slice based iteration at QUORUM
Date Tue, 17 May 2011 20:35:47 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035030#comment-13035030
] 

Peter Schuller commented on CASSANDRA-2643:
-------------------------------------------

You're right of course - my example was bogus. I'll also agree about re-try being reasonable
under the circumstances, though perhaps not optimal.

With regards to the fix. Let me just make sure I understand you correctly. So given a read
command with a limit N that yields <N columns (post-reconciliation), we may need to re-request
from one or more nodes. But how do we distinguish between a legitimate short read and a spurious
short read? The criteria seems to me to be, that a read is potentially spuriously short if
"one or more of the nodes involved returned a NON-short read". If all of them returned short
reads, it's fine; only if we have results from a node that we cannot prove did indeed exhaust
its list of available columns do we need to check.

That is my understanding of your proposed solution, and that does seem doable on the co-ordinator
side without protocol changes since we obviously know what we actually got from each node;
it's just a matter of coding acrobatics (not sure how much work).

However, would you agree with this claim: This would fix the spurious short read problem specifically,
but does not address the more general problem of consistency - i.e., one might receive columns
that have not gone through reconciliation by QUORUM?

If we are to solve that, while still not implying protocol changes, I believe we need to do
re-tries whenever a more general condition is true: That we do not have confirmed QUORUM for
the full range implied by the start+limit range that we are being asked for. In other words,
if one or more of the nodes participating in the read returned a response that satisfies:

  (1) The response was *not* short.
    AND
  (2) The response "last" column was < than the "last" column that we are to return post-reconciliation.

Lacking a protocol change to communicate authoritative ranges of responses, and given that
the premise is that we *must* deliver start+limit unless there are < limit number of columns
available, we necessarily can only consider the full range (first-to-last column) of a response
as authoritative (except in the case of a short read, in which case it's authoritative to
infinity).

Without revisiting the code to try to figure out what the easiest way to implement it is,
one thought is that if you agree that a clean long-term fix would be to communicate authoritativeness
in responses, perhaps one can at least make the logic to handle this compatible with that
way of thinking. It's just that until protocol changes can happen, we'd (1) infer authoritativeness
from columns/tombstones in the result instead of from explicit indicators in a response, and
(2) since we cannot propagate short ranges to clients, we must re-request instead of cleanly
return a short-but-not-eof-indicating range to the client.

Thoughts?

> read repair/reconciliation breaks slice based iteration at QUORUM
> -----------------------------------------------------------------
>
>                 Key: CASSANDRA-2643
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2643
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.5
>            Reporter: Peter Schuller
>            Priority: Critical
>         Attachments: short_read.sh, slicetest.py
>
>
> In short, I believe iterating over columns is impossible to do reliably with QUORUM due
to the way reconciliation works.
> The problem is that the SliceQueryFilter is executing locally when reading on a node,
but no attempts seem to be made to consider limits when doing reconciliation and/or read-repair
(RowRepairResolver.resolveSuperset() and ColumnFamily.resolve()).
> If a node slices and comes up with 100 columns, and another node slices and comes up
with 100 columns, some of which are unique to each side, reconciliation results in > 100
columns in the result set. In this case the effect is limited to "client gets more than asked
for", but the columns still accurately represent the range. This is easily triggered by my
test-case.
> In addition to the client receiving "too many" columns, I believe some of them will not
be satisfying the QUORUM consistency level for the same reasons as with deletions (see discussion
below).
> Now, there *should* be a problem for tombstones as well, but it's more subtle. Suppose
A has:
>   1
>   2
>   3
>   4
>   5
>   6
> and B has:
>   1
>   del 2
>   del 3
>   del 4
>   5
>   6 
> If you now slice 1-6 with count=3 the tombstones from B will reconcile with those from
A - fine. So you end up getting 1,5,6 back. This made it a bit difficult to trigger in a test
case until I realized what was going on. At first I was "hoping" to see a "short" iteration
result, which would mean that the process of iterating until you get a short result will cause
spurious "end of columns" and thus make it impossible to iterate correctly.
> So; due to 5-6 existing (and if they didn't, you legitimately reached end-of-columns)
we do indeed get a result of size 3 which contains 1,5 and 6. However, only node B would have
contributed columns 5 and 6; so there is actually no QUORUM consistency on the co-ordinating
node with respect to these columns. If node A and C also had 5 and 6, they would not have
been considered.
> Am I wrong?
> In any case; using script I'm about to attach, you can trigger the over-delivery case
very easily:
> (0) disable hinted hand-off to avoid that interacting with the test
> (1) start three nodes
> (2) create ks 'test' with rf=3 and cf 'slicetest'
> (3) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, then ctrl-c
> (4) stop node A
> (5) ./slicetest.py hostname_of_node_C insert # let it run for a few seconds, then ctrl-c
> (6) start node A, wait for B and C to consider it up
> (7) ./slicetest.py hostname_of_node_A slice # make A co-ordinator though it doesn't necessarily
matter
> You can also pass 'delete' (random deletion of 50% of contents) or 'deleterange' (delete
all in [0.2,0.8]) to slicetest, but you don't trigger a short read by doing that (see discussion
above).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message