cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-8316) "Did not get positive replies from all endpoints" error on incremental repair
Date Wed, 17 Dec 2014 10:28:14 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249690#comment-14249690
] 

Marcus Eriksson edited comment on CASSANDRA-8316 at 12/17/14 10:27 AM:
-----------------------------------------------------------------------

I think we are simply timing out the Prepare message when TRACE is enabled (I can't even start
a 8 node cluster with TRACE on)

One solution could be to increase the timeout for this message, but we use the same timeout
for snapshot creation and that would be just as likely to fail on a heavily loaded cluster,
wdyt [~yukim]?

Also, note, that in your test you repair all ranges, meaning, when you repair node5 for example,
you actually include node3,4,5,6,7, so you can't repair any of those at the same time



was (Author: krummas):
I think we are simply timing out the Prepare message when TRACE is enabled (I can't even start
a 8 node cluster with TRACE on)

One solution could be to increase the timeout, but we use the same timeout for snapshot creation
and that would be just as likely to fail on a heavily loaded cluster, wdyt [~yukim]?

Also, note, that in your test you repair all ranges, meaning, when you repair node5 for example,
you actually include node3,4,5,6,7, so you can't repair any of those at the same time


>  "Did not get positive replies from all endpoints" error on incremental repair
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8316
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: cassandra 2.1.2
>            Reporter: Loic Lambiel
>            Assignee: Marcus Eriksson
>             Fix For: 2.1.3
>
>         Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz,
CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh
>
>
> Hi,
> I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster,
not yet loaded, RF=3)
> After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started
receiving "Repair failed with error Did not get positive replies from all endpoints." from
nodetool on all remaining nodes :
> [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace
xxxx (seq=false, full=false)
> [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from
all endpoints.
> All the nodes are up and running and the local system log shows that the repair commands
got started and that's it.
> I've also noticed that soon after the repair, several nodes started having more cpu load
indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then
restarted C* on these nodes and retried the repair on several nodes, which were successful
until facing the issue again.
> I tried to repro on our 3 nodes preproduction cluster without success
> It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
> Any idea?
> Thanks
> Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message