lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-13376) Multi-node race condition to create/remove nodeLost markers
Date Tue, 09 Apr 2019 11:00:00 GMT

    [ https://issues.apache.org/jira/browse/SOLR-13376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813271#comment-16813271
] 

Andrzej Bialecki  commented on SOLR-13376:
------------------------------------------

bq. it's expected that InactiveMarkersPlanAction is what will clean up the markers

It's expected to _eventually_ clean them - the trigger runs once a day. That's why the section
in {{OverseerTriggerThread.run()}} was removing them on overseer leader change, to clean the
markers that we know for sure are no longer needed. And apparently this creates the race condition.
 
bq.  you just re-enabled the test (w/o any modifications to it) and re-resolved this issue

Well, for the record, see 1cfbd3e1c84d35e741cfc068a8e88f0eff4ea9e1 where I tried to address
another source of the test's instability, and the test's reliability improved after that change.
The race condition that you discovered is something new that I wasn't aware of before, so
I'm going to fix it (and add the missing documentation on {{.scheduled_maintenance}} trigger).

> Multi-node race condition to create/remove nodeLost markers
> -----------------------------------------------------------
>
>                 Key: SOLR-13376
>                 URL: https://issues.apache.org/jira/browse/SOLR-13376
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Hoss Man
>            Assignee: Andrzej Bialecki 
>            Priority: Major
>
> NodeMarkersRegistrationTest.testNodeMarkersRegistration is frequently failing on jenkins
builds in the same spot, with a similar looking logs.
> Although i haven't been able to reproduce these failures locally, I am fairly confident
that the problem is a race condition bug that exists between when/how a new Overseer will
process & clean up "nodeLost" marker's in ZK, with how other nodes may (mistakenly) re-create
those markers in their liveNodes listener.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message