lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SOLR-7936) Bogus failure when deleting collections.
Date Mon, 17 Aug 2015 06:39:45 GMT
Erick Erickson created SOLR-7936:
------------------------------------

             Summary: Bogus failure when deleting collections.
                 Key: SOLR-7936
                 URL: https://issues.apache.org/jira/browse/SOLR-7936
             Project: Solr
          Issue Type: Bug
            Reporter: Erick Erickson
            Assignee: Erick Erickson


When looking at the CDCR test failures, we began to wonder whether the problem was
1> the cdcr code itself
2> the test framework
3> Solr

Some of the failures seem to be "impossible" assuming collection creation/deletion work OK.

So I wrote a little program to exercise collection creation/deletion outside the test framework
by just adding and deleting the same collection over and over and over again, and it started
regularly failing in OverseerCollectionMessageHandler.deleteCollection about line 780 it would
throw the "Could not fully remove the collection" exception:

{code}
      TimeOut timeout = new TimeOut(30, TimeUnit.SECONDS);
      boolean removed = false;
      while (! timeout.hasTimedOut()) {
        Thread.sleep(100);
        // WORKS SO FAR IF UNCOMMENTED zkStateReader.updateClusterState();
        removed = !zkStateReader.getClusterState().hasCollection(collection);
        if (removed) {
          Thread.sleep(500); // just a bit of time so it's more likely other
                             // readers see on return
          break;
        }
      }
      if (!removed) {
        throw new SolrException(ErrorCode.SERVER_ERROR,
            "Could not fully remove collection: " + collection);
      }
{code}

However, the collection is really gone from clusterstate. When I put the updateClusterState()
in above, it doesn't seem to fail. Is it as simple as the updateClusterState() call?

Without the update in place, it failed within 20 reps very regularly. So far, with the update
in place we're at 132 and counting. Any comments?

If this runs 1,000 times tonight, I'll check it in if there are no objections. I don't know
what it means for CDCR yet though.

I'm also suspicious of the 500ms sleep. Anyone have a clue what that's in there for?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message