cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Caleb Rackliffe (Jira)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-16721) Repaired data tracking on a read coordinator is susceptible to races between local and remote requests
Date Mon, 16 Aug 2021 19:45:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17399972#comment-17399972
] 

Caleb Rackliffe commented on CASSANDRA-16721:
---------------------------------------------

I've made a first pass at the patch, and I think it does solve the problem described in the
description above. However, there are a few questions I'm struggling with:
 

1.) Why do we share any aspect of {{RepairedDataInfo}} across threads at all? It seems like
both the problem above and a class of other possible problems (read on) would be sidestepped
completely. More specifically, perhaps we could do something like just indicating to the {{ReadExecutionController}}
whether we should track repaired status?

2.) If we follow the scenario above, and two remote reads return and indicate a mismatch while
the local read is still executing, is it possible that both the local read (likely on a Native
Transport thread, but possibly on a ReadStage thread) and the local read started in {{startRepair()}}
(and now on a ReadStage thread) use the same {{RepairedDataInfo}} instance as they serialize
their local data responses?

 
Even if the second item above isn't possible, it still seems like our implementation would
be less brittle if if we could find a minimally invasive way to make the change in the first
item. I'm open to making a pass at it, but I want to make sure my starting assumptions are
correct.

> Repaired data tracking on a read coordinator is susceptible to races between local and
remote requests
> ------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16721
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16721
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Sam Tunnicliffe
>            Assignee: Sam Tunnicliffe
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> At read time on a coordinator which is also a replica, the local and remote reads can
race such that the remote responses are received while the local read is executing. If the
remote responses are mismatching, triggering a {{DigestMismatchException}} and subsequent
round of full data reads and read repair, the local runnable may find the {{isTrackingRepairedStatus}}
flag flipped mid-execution.  If this happens after a certain point in execution, it would
mean
> that the RepairedDataInfo instance in use is the singleton null object {{RepairedDataInfo.NULL_REPAIRED_DATA_INFO}}.
If this happens, it can lead to an NPE when calling {{RepairedDataInfo::extend}} when the
local results are iterated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message